Thread: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

[HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Chris Travers
Date:
Hi;

There are still some cleanup bits needed here but I wanted to get feedback on my general direction.

I hope to submit for commit fest soon if the general feedback is good.  Tests are passing (with adjustments intended for change of behaviour in one test script).  I want to note that Crimson Thompson (also with Adjust) has provided some valuable discussion on this code.

The Problem:

pg_rewind makes no real guarantees regarding the final state of non-data files or directories.  It.checks to see if the timeline has incremented (and therefore guarantees that if successful the data directories are on the same timeline) but for non-data files, these are clobbered if we rewind and left intact if not.  These other files include postgresql.auto.conf, replication slots, and can include log files.

Copying logs over to the new slave is something one probably never wants to do (same with replication slots), and the behaviours here can mask troubleshooting regarding what a particular master failed, cause wal segments to build up, automatic settings changes, and other undesirable behaviours.  Together these make pg_rewind very difficult to use properly and push tasks to replication management tooling that the management tools are not in a good position to handle correctly.

Designing the Fix:

Two proposed fixes have been considered and one selected:  Either we whitelist directories and only rewind those.  The other was to allow shell globs to be provided that could be used to exclude files.  The whitelisting solution was chosen for a number of reasons.

When pg_rewind "succeeds" but not quite correctly the result is usually a corrupted installation which requires a base backup to replace it anyway.  In a recovery situation, sometimes pressures occur which render human judgment less effective, and therefore glob exclusion strikes me as something which would probably do more harm than good, but maybe I don't understand the use case (comments as to why some people look at the other solution as preferable would be welcome).

In going with the whitelisting solution, we chose to include all directories with WAL-related information.    This allows more predicable interaction with other parts of the replication chain.  Consequently not only do we copy pg_wal and pg_xact but also commit timestamps and so forth.

The Solution:

The solution is a whitelist of directories specified which are the only ones which are synchronised.  The relevant part of this patch is:

+/* List of directories to synchronize:

+ * base data dirs (and ablespaces)

+ * wal/transaction data

+ * and that is it.

+ *

+ * This array is null-terminated to make

+ * it easy to expand

+ */

+

+const char *rewind_dirs[] = {

+    "base",

+    "global",

+    "pg_commit_ts",

+    "pg_logical",

+    "pg_multixact",

+    "pg_serial",

+    "pg_subtrans",

+    "pg_tblspc",

+    "pg_twophase",

+    "pg_wal",

+    "pg_xact",

+    NULL

+};


From there we iterate over this array for directory-based approaches in copy_fetch.c and with one query per directory in libpq_fetch.c.  This also means shifting from the basic interface from PQexec to PQexecParams.  It would be possible to move to binary formats too, but this was not done currently in order to simplify code review (that could be a separate independent patch at a later time).


Testing Done:

The extra files tests correctly test this change in behaviour.  The tests have been modified, and diagnostics in cases of failures expanded, in this case.  The other tests provide good coverage of whether pg_rewind does what it is supposed to do.

Cleanup still required:

I accidentally left Carp::Always in the PM in this perl module.  It will be fixed.

I renamed one of the functions used to have a more descriptive name but currently did not remove the old function yet. 


Feedback is very welcome.  pg_rewind is a very nice piece of software.  I am hoping that these sorts of changes will help ensure that it is easier to use and provides more predictable results.
--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Attachment

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Robert Haas
Date:
On Sat, Oct 28, 2017 at 4:52 PM, Chris Travers <chris.travers@adjust.com> wrote:
> There are still some cleanup bits needed here but I wanted to get feedback
> on my general direction.
>
> I hope to submit for commit fest soon if the general feedback is good.

I think you should submit to the CommitFest regardless, to increase
your chances of getting feedback.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Michael Paquier
Date:
On Sat, Oct 28, 2017 at 4:22 AM, Chris Travers <chris.travers@adjust.com> wrote:
> The Solution:
> The solution is a whitelist of directories specified which are the only ones
> which are synchronised.  The relevant part of this patch is:
>
> +/* List of directories to synchronize:
> + * base data dirs (and ablespaces)
> + * wal/transaction data
> + * and that is it.
> + *
> + * This array is null-terminated to make
> + * it easy to expand
> + */
>
> +const char *rewind_dirs[] = {
> +    "base",
> +    "global",
> +    "pg_commit_ts",
> +    "pg_logical",
> +    "pg_multixact",
> +    "pg_serial",
> +    "pg_subtrans",
> +    "pg_tblspc",
> +    "pg_twophase",
> +    "pg_wal",
> +    "pg_xact",
> +    NULL
> +};

The problem with a white list, which is one reason why I do not favor
it, is in the case where a feature adds in the data folder a new path
for its data, particularly in the case where this is critical for a
base backup. If this folder is not added in this list, then a rewind
may be silently corrupted as well. There are also a set of directories
in this list that we do not care about, pg_serial being one.
pg_subtrans is a second, as it gets zeroed on startup.

And if you think about it, pg_rewind is actually the *same* processing
as a base backup taken using the replication protocol. So instead of
this patch I would recommend using excludeFiles and excludeDirContents
by moving this list to a common header where frontend and backend can
refer to. Note that basebackup.h includes replnodes.h, so my thoughts
is that you should split the current header with something like
basebackup_internal.h which is backend-only, and have pg_rewind fetch
the list of directories to automatically ignore as those ones. You can
also simplify a bit the code of pg_rewind a bit so as things like
postmaster.opts. On top of that, there is visibly no need to tweak the
SQL query fetching the directory list (which is useful for debugging
as shaped now actually), but just change process_source_file so as its
content is not added in the file map.

Then there is a second case where you do not want a set of folders to
be included, but those can be useful by default, an example here being
pg_wal where one might want to have an empty path to begin with. A
third case is a set of folders that you do not care about, but depends
on the deployment, being here "log" for example. For those you could
just add an --exclude-path option which simply piles up paths into a
linked list when called multiple times. This could happen with a
second patch.

> From there we iterate over this array for directory-based approaches in
> copy_fetch.c and with one query per directory in libpq_fetch.c.  This also
> means shifting from the basic interface from PQexec to PQexecParams.  It
> would be possible to move to binary formats too, but this was not done
> currently in order to simplify code review (that could be a separate
> independent patch at a later time).

-    res = PQexec(conn, sql);
+    for (p = 0; rewind_dirs[p] != NULL; ++p)
+    {
+        const char *paths[1];
+        paths[0] = rewind_dirs[p];
+        res = PQexecParams(conn, sql,  1, NULL, paths, NULL, NULL, 0);
Calling multiple times the query to list all directories is messy IMO.
And this is N-costly processing if there are many files to look at,
say many relation files.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Chris Travers
Date:
First, thanks for your thoughts on this, and I am interested in probing them more.

On Mon, Oct 30, 2017 at 9:04 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Sat, Oct 28, 2017 at 4:22 AM, Chris Travers <chris.travers@adjust.com> wrote:
> The Solution:
> The solution is a whitelist of directories specified which are the only ones
> which are synchronised.  The relevant part of this patch is:
>
> +/* List of directories to synchronize:
> + * base data dirs (and ablespaces)
> + * wal/transaction data
> + * and that is it.
> + *
> + * This array is null-terminated to make
> + * it easy to expand
> + */
>
> +const char *rewind_dirs[] = {
> +    "base",
> +    "global",
> +    "pg_commit_ts",
> +    "pg_logical",
> +    "pg_multixact",
> +    "pg_serial",
> +    "pg_subtrans",
> +    "pg_tblspc",
> +    "pg_twophase",
> +    "pg_wal",
> +    "pg_xact",
> +    NULL
> +};

The problem with a white list, which is one reason why I do not favour
it, is in the case where a feature adds in the data folder a new path
for its data, particularly in the case where this is critical for a
base backup. If this folder is not added in this list, then a rewind
may be silently corrupted as well. There are also a set of directories
in this list that we do not care about, pg_serial being one.
pg_subtrans is a second, as it gets zeroed on startup.

The problem with an exclude list is that we cannot safely exclude configuration files or logs (because these could be named anything), right?

Are there any cases right now where you have features added by extensions that write to directories which are required for a rewind?  I am asking because I would like to see if this is the minimum working change or if this change is fundamentally broken currently and must be extended to allow custom paths to be sync'd as well.

And if you think about it, pg_rewind is actually the *same* processing
as a base backup taken using the replication protocol. So instead of
this patch I would recommend using excludeFiles and excludeDirContents
by moving this list to a common header where frontend and backend can
refer to. Note that basebackup.h includes replnodes.h, so my thoughts
is that you should split the current header with something like
basebackup_internal.h which is backend-only, and have pg_rewind fetch
the list of directories to automatically ignore as those ones. You can
also simplify a bit the code of pg_rewind a bit so as things like
postmaster.opts. On top of that, there is visibly no need to tweak the
SQL query fetching the directory list (which is useful for debugging
as shaped now actually), but just change process_source_file so as its
content is not added in the file map.

I am not sure it *should* be the same, however.  In a backup we probably want to backup the postgresql.auto.conf, but on a failover, I don't think we want to clobber configuration.  We certainly do not want to sometimes clobber configuration but not other times (which is what happens right now in some cases).  And under no circumstances do we want to clobber logs on a failed server with logs on a working server.  That's asking for serious problems in my view. 

If you think about it, there's a huge difference in use case in backing up a database cluster (Including replication slots, configs in the dir etc) and re-syncing the data so that replication can resume, and I think there are some dangers that come up when assuming these should be closely tied together.

Then there is a second case where you do not want a set of folders to
be included, but those can be useful by default, an example here being
pg_wal where one might want to have an empty path to begin with. A
third case is a set of folders that you do not care about, but depends
on the deployment, being here "log" for example. For those you could
just add an --exclude-path option which simply piles up paths into a
linked list when called multiple times. This could happen with a
second patch.

Agreed.  And one could add an "--include-path" option to allow for unusual cases where you want extra directories, such as replication slots, or the like.

I think another patch could also specifically empty and perhaps create a replication slot allowing for one to bring tings back up via streaming replication more safely.

> From there we iterate over this array for directory-based approaches in
> copy_fetch.c and with one query per directory in libpq_fetch.c.  This also
> means shifting from the basic interface from PQexec to PQexecParams.  It
> would be possible to move to binary formats too, but this was not done
> currently in order to simplify code review (that could be a separate
> independent patch at a later time).

-    res = PQexec(conn, sql);
+    for (p = 0; rewind_dirs[p] != NULL; ++p)
+    {
+        const char *paths[1];
+        paths[0] = rewind_dirs[p];
+        res = PQexecParams(conn, sql,  1, NULL, paths, NULL, NULL, 0);
Calling multiple times the query to list all directories is messy IMO.
And this is N-costly processing if there are many files to look at,
say many relation files.

We wouldn't recurse into the relation files dir more than once since this is filtered out on the initial query in the recursive CTE before recursion.  In other words, the query filters out the top-level query before it recurses into any directory so there is a smalll cost (planning, the fact that the root of the pgdata is queried once per) but this seems pretty minor given that we need at least one query per file that we sync anyway.

I am open to changing it to an array, but one thing to think about is that if we want to add an ability to add extra directories, this makes it much easier to do that.
 
--
Michael



--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Michael Paquier
Date:
On Mon, Oct 30, 2017 at 9:43 AM, Chris Travers <chris.travers@adjust.com> wrote:
> Are there any cases right now where you have features added by extensions that write to directories which are
requiredfor a rewind?
 

In some of the stuff I maintain, I actually have one case now of a
configuration file included with include_if_exists by postgresql.conf
and is expected to be generated by a component that my code doing the
rewind has no direct access on... I can control how pg_rewind kicks
in, but I think that you would actually break silently the current
code, which is scary especially if I don't control the code where
pg_rewind is called when Postgres 11 gets integrated into the product
I am thinking about, and even more scary if there is no way to include
something.

> The problem with an exclude list is that we cannot safely exclude
> configuration files or logs (because these could be named anything), right?

You have the exact same problem with base backups now, and any
configuration files created by extensions are a per-case problem...
The pattern that base backups now use is an exclude list. In the
future I would rather see base backups and pg_rewind using the same
infrastructure for both things:
- pg_rewind should use the replication protocol with BASE_BACKUP.
Having it rely on root access now is crazy.
- BASE_BACKUP should have an option where it is possible to exclude
custom paths.
What you are proposing here would make both diverge more, which is in
my opinion not a good approach.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Chris Travers
Date:


On Mon, Oct 30, 2017 at 10:57 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Mon, Oct 30, 2017 at 9:43 AM, Chris Travers <chris.travers@adjust.com> wrote:
> Are there any cases right now where you have features added by extensions that write to directories which are required for a rewind?

In some of the stuff I maintain, I actually have one case now of a
configuration file included with include_if_exists by postgresql.conf
and is expected to be generated by a component that my code doing the
rewind has no direct access on... I can control how pg_rewind kicks
in, but I think that you would actually break silently the current
code, which is scary especially if I don't control the code where
pg_rewind is called when Postgres 11 gets integrated into the product
I am thinking about, and even more scary if there is no way to include
something.

Ok, so there is an argument that there needs to be a way to include additional paths in this patch.  One important question I would have in these cases is if you expect these to be set only on the master.  If not, then is this a problem and if not then how do you handle the fail-back problem etc?

This also brings up a fairly major concern more generally about control by the way.  A lot of cases where pg_rewind is called, the user doesn't necessarily have much control on how it is called.  Moreover in many of these cases, the user is probably not in a position to understand the internals well enough to grasp what to check after.

> The problem with an exclude list is that we cannot safely exclude
> configuration files or logs (because these could be named anything), right?

You have the exact same problem with base backups now, and any
configuration files created by extensions are a per-case problem...

Right, but there is a use case difference between "I am taking a backup of a server" and "I need to get the server into  rejoin the replication as a standby."

A really good example of where this is a big problem is with replication slots.  On a backup I would expect you want replication slots to be copied in.  However when setting yourself up as a slave you most certainly do not want to just copy these over from the master unless you have infinite disk space.  I would argue that under *no* circumstance should pg_rewind *ever* copy replication slots.  But pg_base_backup really *should* (and when provisioning a new slave you should clear these as soon as it is set up).

The pattern that base backups now use is an exclude list. In the
future I would rather see base backups and pg_rewind using the same
infrastructure for both things:
- pg_rewind should use the replication protocol with BASE_BACKUP.
Having it rely on root access now is crazy.
- BASE_BACKUP should have an option where it is possible to exclude
custom paths.
What you are proposing here would make both diverge more, which is in
my opinion not a good approach.

How does rep mgr or other programs using pg_rewind know what to exclude?

Again I am not convinced setting up a replica and taking a backup for disaster recovery are the same use case or have the same requirements.

--
Michael



--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Michael Paquier
Date:
On Mon, Oct 30, 2017 at 10:15 AM, Chris Travers
<chris.travers@adjust.com> wrote:
> This also brings up a fairly major concern more generally about control by
> the way.  A lot of cases where pg_rewind is called, the user doesn't
> necessarily have much control on how it is called.  Moreover in many of
> these cases, the user is probably not in a position to understand the
> internals well enough to grasp what to check after.

Likely they are not.

> Right, but there is a use case difference between "I am taking a backup of a
> server" and "I need to get the server into  rejoin the replication as a
> standby."

The intersection of the first and second categories is not empty. Some
take backups and use them to deploy standbys.

> A really good example of where this is a big problem is with replication
> slots.  On a backup I would expect you want replication slots to be copied
> in.

I would actually expect the contrary, and please note that replication
slots are not taken in a base backup, which is what the documentation
says as well:
https://www.postgresql.org/docs/10/static/protocol-replication.html
"pg_dynshmem, pg_notify, pg_replslot, pg_serial, pg_snapshots,
pg_stat_tmp, and pg_subtrans are copied as empty directories (even if
they are symbolic links)."

Some code I have with 9.6's pg_bsaebackup removes manually replication
slots as this logic is new in 10 ;)

>> The pattern that base backups now use is an exclude list. In the
>> future I would rather see base backups and pg_rewind using the same
>> infrastructure for both things:
>> - pg_rewind should use the replication protocol with BASE_BACKUP.
>> Having it rely on root access now is crazy.
>> - BASE_BACKUP should have an option where it is possible to exclude
>> custom paths.
>> What you are proposing here would make both diverge more, which is in
>> my opinion not a good approach.
>
> How does rep mgr or other programs using pg_rewind know what to exclude?

Good question. Answers could come from folks such as David Steele
(pgBackRest) or Marco (barman) whom I am attaching in CC.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Chris Travers
Date:


On Mon, Oct 30, 2017 at 11:36 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Mon, Oct 30, 2017 at 10:15 AM, Chris Travers
<chris.travers@adjust.com> wrote:
> This also brings up a fairly major concern more generally about control by
> the way.  A lot of cases where pg_rewind is called, the user doesn't
> necessarily have much control on how it is called.  Moreover in many of
> these cases, the user is probably not in a position to understand the
> internals well enough to grasp what to check after.

Likely they are not.

So currently I am submitting the current patch to commit fest.  I am open to adding a new 
--Include-path option in this patch, but I am worried about atomicity concerns, and I am still
not really sure what the impact is for you (I haven't heard whether you expect this file to be
there before the rewind, i.e. whether it would be on both master and slave, or just on the slave). 

Sp there is the question of under what specific circumstances this would break for you.

> Right, but there is a use case difference between "I am taking a backup of a
> server" and "I need to get the server into  rejoin the replication as a
> standby."

The intersection of the first and second categories is not empty. Some
take backups and use them to deploy standbys.

Sure, but it is not complete either. 

> A really good example of where this is a big problem is with replication
> slots.  On a backup I would expect you want replication slots to be copied
> in.

I would actually expect the contrary, and please note that replication
slots are not taken in a base backup, which is what the documentation
says as well:
https://www.postgresql.org/docs/10/static/protocol-replication.html
"pg_dynshmem, pg_notify, pg_replslot, pg_serial, pg_snapshots,
pg_stat_tmp, and pg_subtrans are copied as empty directories (even if
they are symbolic links)."

Many of these are emptied on server restart but repslot is not.  What is the logic
for excluding it from backups other than to avoid problems with replication provisioning?

I mean if I have a replication slot for taking backups with barman (and I am not actually doing replication) why would I *not* want that in my base backup if I might have to restore to a new machine somewhere?

Some code I have with 9.6's pg_bsaebackup removes manually replication
slots as this logic is new in 10 ;)

>> The pattern that base backups now use is an exclude list. In the
>> future I would rather see base backups and pg_rewind using the same
>> infrastructure for both things:
>> - pg_rewind should use the replication protocol with BASE_BACKUP.
>> Having it rely on root access now is crazy.
>> - BASE_BACKUP should have an option where it is possible to exclude
>> custom paths.
>> What you are proposing here would make both diverge more, which is in
>> my opinion not a good approach.
>
> How does rep mgr or other programs using pg_rewind know what to exclude?

Good question. Answers could come from folks such as David Steele
(pgBackRest) or Marco (barman) whom I am attaching in CC.

Two points that further occur to me:

Shell globs seem to me to be foot guns in this area, I think special paths should be one path per invocation of the option not "--exclude=pg_w*" since this is just asking for problems as time goes on and things get renamed.

It also seems to me that adding specific paths is far safer than removing specific paths. 
--
Michael



--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
David Steele
Date:
On 10/30/17 6:36 AM, Michael Paquier wrote:
> On Mon, Oct 30, 2017 at 10:15 AM, Chris Travers
>>
>> How does rep mgr or other programs using pg_rewind know what to exclude?
> 
> Good question. Answers could come from folks such as David Steele
> (pgBackRest) or Marco (barman) whom I am attaching in CC.

pgBackRest does not use pg_rewind.  However, pgBackRest does use the
same exclusion list for backups as pg_basebackup.

-- 
-David
david@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Chris Travers
Date:
Attached is the patch as submitted for commitfest.

Please note, I am not adverse to adding an additional --Include-path directive if that avoids backwards-compatibility problems.  However the patch is complex enough I would really prefer review on the rest of it to start first.  This doesn't strike me as perfectly trivial and I think it is easier to review pieces separately.  I have already started on the --Include-path directive and could probably get it attached to a later version of the patch very soon.

I would also like to address a couple of important points here:

1.  I think default restrictions plus additional paths is the best, safest way forward.  Excluding shell-globs doesn't solve the "I need this particular config file" very well particularly if we want to support this outside of an internal environment.  Shell globs also tend to be overly inclusive and so if you exclude based on them, you run into a chance that your rewind is corrupt for being overly exclusive.

2.  I would propose any need for an additional paths be specified using an --Include-path directive.  This could be specified multiple times and could point to a file or directory which would be added to the paths rewound.  I have a patch for this mostly done but I am concerned that these sorts of changes result in a combination of changes that are easier to review separately than together.  So if needed, I can add it or in a separate patch following.

3.  I think it would be a mistake to tie backup solutions in non-replicated environments to replication use cases, and vice versa.  Things like replication slots (used for streaming backups) have different considerations in different environments.  Rather than build the same infrastructure first, I think it is better to support different use cases well and then build common infrastructure to support the different cases.  I am not against building common infrastructure for pg_rewind and pg_basebackup.  I am very much against having the core guarantees being the exact same.

Best Wishes,
Chris Travers

On Sat, Oct 28, 2017 at 1:22 PM, Chris Travers <chris.travers@adjust.com> wrote:
Hi;

There are still some cleanup bits needed here but I wanted to get feedback on my general direction.

I hope to submit for commit fest soon if the general feedback is good.  Tests are passing (with adjustments intended for change of behaviour in one test script).  I want to note that Crimson Thompson (also with Adjust) has provided some valuable discussion on this code.

The Problem:

pg_rewind makes no real guarantees regarding the final state of non-data files or directories.  It.checks to see if the timeline has incremented (and therefore guarantees that if successful the data directories are on the same timeline) but for non-data files, these are clobbered if we rewind and left intact if not.  These other files include postgresql.auto.conf, replication slots, and can include log files.

Copying logs over to the new slave is something one probably never wants to do (same with replication slots), and the behaviours here can mask troubleshooting regarding what a particular master failed, cause wal segments to build up, automatic settings changes, and other undesirable behaviours.  Together these make pg_rewind very difficult to use properly and push tasks to replication management tooling that the management tools are not in a good position to handle correctly.

Designing the Fix:

Two proposed fixes have been considered and one selected:  Either we whitelist directories and only rewind those.  The other was to allow shell globs to be provided that could be used to exclude files.  The whitelisting solution was chosen for a number of reasons.

When pg_rewind "succeeds" but not quite correctly the result is usually a corrupted installation which requires a base backup to replace it anyway.  In a recovery situation, sometimes pressures occur which render human judgment less effective, and therefore glob exclusion strikes me as something which would probably do more harm than good, but maybe I don't understand the use case (comments as to why some people look at the other solution as preferable would be welcome).

In going with the whitelisting solution, we chose to include all directories with WAL-related information.    This allows more predicable interaction with other parts of the replication chain.  Consequently not only do we copy pg_wal and pg_xact but also commit timestamps and so forth.

The Solution:

The solution is a whitelist of directories specified which are the only ones which are synchronised.  The relevant part of this patch is:

+/* List of directories to synchronize:

+ * base data dirs (and ablespaces)

+ * wal/transaction data

+ * and that is it.

+ *

+ * This array is null-terminated to make

+ * it easy to expand

+ */

+

+const char *rewind_dirs[] = {

+    "base",

+    "global",

+    "pg_commit_ts",

+    "pg_logical",

+    "pg_multixact",

+    "pg_serial",

+    "pg_subtrans",

+    "pg_tblspc",

+    "pg_twophase",

+    "pg_wal",

+    "pg_xact",

+    NULL

+};


From there we iterate over this array for directory-based approaches in copy_fetch.c and with one query per directory in libpq_fetch.c.  This also means shifting from the basic interface from PQexec to PQexecParams.  It would be possible to move to binary formats too, but this was not done currently in order to simplify code review (that could be a separate independent patch at a later time).


Testing Done:

The extra files tests correctly test this change in behaviour.  The tests have been modified, and diagnostics in cases of failures expanded, in this case.  The other tests provide good coverage of whether pg_rewind does what it is supposed to do.

Cleanup still required:

I accidentally left Carp::Always in the PM in this perl module.  It will be fixed.

I renamed one of the functions used to have a more descriptive name but currently did not remove the old function yet. 


Feedback is very welcome.  pg_rewind is a very nice piece of software.  I am hoping that these sorts of changes will help ensure that it is easier to use and provides more predictable results.
--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin




--
Best Regards,
Chris Travers
Database Administrator

Tel: +49 162 9037 210 | Skype: einhverfr | www.adjust.com 
Saarbrücker Straße 37a, 10405 Berlin

Attachment

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Michael Paquier
Date:
On Wed, Nov 1, 2017 at 5:58 PM, Chris Travers <chris.travers@adjust.com> wrote:
> I would also like to address a couple of important points here:
>
> 1.  I think default restrictions plus additional paths is the best, safest
> way forward.  Excluding shell-globs doesn't solve the "I need this
> particular config file" very well particularly if we want to support this
> outside of an internal environment.  Shell globs also tend to be overly
> inclusive and so if you exclude based on them, you run into a chance that
> your rewind is corrupt for being overly exclusive.
>
> 2.  I would propose any need for an additional paths be specified using an
> --Include-path directive.  This could be specified multiple times and could
> point to a file or directory which would be added to the paths rewound.  I
> have a patch for this mostly done but I am concerned that these sorts of
> changes result in a combination of changes that are easier to review
> separately than together.  So if needed, I can add it or in a separate patch
> following.
>
> 3.  I think it would be a mistake to tie backup solutions in non-replicated
> environments to replication use cases, and vice versa.  Things like
> replication slots (used for streaming backups) have different considerations
> in different environments.  Rather than build the same infrastructure first,
> I think it is better to support different use cases well and then build
> common infrastructure to support the different cases.  I am not against
> building common infrastructure for pg_rewind and pg_basebackup.  I am very
> much against having the core guarantees being the exact same.

+const char *rewind_dirs[] = {
+    "base",         // Default tablespace
+    "global",       // global tablespace
+    "pg_commit_ts", // In case we need to do PITR before up to sync
+    "pg_logical",   // WAL related and no good reason to exclude
+    "pg_multixact", // WAL related and may need for vacuum-related reasons
+    "pg_tblspc",    // Pther tablespaces
+    "pg_twophase",  // mostly to *clear*
+    "pg_wal",       // WAL
+    "pg_xact",      // Commits of transactions
+    NULL
+};
Incorrect comment format here.

+   in full. The advantage of <application>pg_rewind</> over taking a new base
+   backup, or tools like <application>rsync</>, is that
<application>pg_rewind</>
This generates warnings on HEAD when compiling the documentation.

Please note that I am still -1 for using a methodology different than
what is used for base backups with an inclusive method, and would much
prefer an exclusive method by reusing the existing entries in
basebackup.c. Still, I am the only one who expressed an opinion about
this patch, so moved to next CF with waiting on author as status.
-- 
Michael


Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
David Steele
Date:
On 11/29/17 12:46 AM, Michael Paquier wrote:> On Wed, Nov 1, 2017 at
5:58 PM, Chris Travers <chris.travers@adjust.com> wrote:
>
> Please note that I am still -1 for using a methodology different than
> what is used for base backups with an inclusive method, and would much
> prefer an exclusive method by reusing the existing entries in
> basebackup.c. Still, I am the only one who expressed an opinion about
> this patch, so moved to next CF with waiting on author as status.

I'm also -1 on the inclusive methodology.  Forgetting something in the
exclusion list just makes the process less efficient while forgetting
something in the inclusion list may mean breakage.  Furthermore,
maintaining two lists does not sound like a good idea.

I worry that extensions using generic WAL might be writing in places we
don't expect and don't think manually adding inclusions is a good
solution as the requirement will not be obvious to the user.

Regards,
-- 
-David
david@pgmasters.net


Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Stephen Frost
Date:
Chris,

* Chris Travers (chris.travers@adjust.com) wrote:
> Attached is the patch as submitted for commitfest.

This has been stuck in Waiting for Author since before the commitfest
started.  I'll try to kick-start it but it seems like it's stuck because
there's a fundamental disagreement about if we should be using include
or exclude for pg_rewind (and, possibly, pg_basebackup, et al).

> Please note, I am not adverse to adding an additional --Include-path
> directive if that avoids backwards-compatibility problems.  However the
> patch is complex enough I would really prefer review on the rest of it to
> start first.  This doesn't strike me as perfectly trivial and I think it is
> easier to review pieces separately.  I have already started on the
> --Include-path directive and could probably get it attached to a later
> version of the patch very soon.
>
> I would also like to address a couple of important points here:
>
> 1.  I think default restrictions plus additional paths is the best, safest
> way forward.  Excluding shell-globs doesn't solve the "I need this
> particular config file" very well particularly if we want to support this
> outside of an internal environment.  Shell globs also tend to be overly
> inclusive and so if you exclude based on them, you run into a chance that
> your rewind is corrupt for being overly exclusive.

There's been a number of strong points made as to why pg_basebackup uses
an exclude list but you seem to keep coming back to a concern around
shell globs when considering exclusion lists.

Having an exclude list doesn't mean that shell globs have to be used to
in the exclude list and, indeed, there are no such globs used in the
exclude list for pg_basebackup today- every single file or directory
excluded by pg_basebackup is an explicit file or directory (compared
using a full strcmp()).

Further, the specific concerns raised are about things which are very
clearly able to be dealt with using specific paths
(postgresql.auto.conf, pg_log) and which an administrator is likely to
be familiar with (such as pg_log, in particular).

Personally, I'm a big advocate of *not* having PG's logs in the data
directory and part of it is because, really, the data directory is PG's
purview while logs are for the administrator.  I wasn't ever a fan of
postgresql.auto.conf and I'm not surprised that we're having this
discussion about if it should be pulled over by pg_rewind or not.

> 3.  I think it would be a mistake to tie backup solutions in non-replicated
> environments to replication use cases, and vice versa.  Things like
> replication slots (used for streaming backups) have different
> considerations in different environments.  Rather than build the same
> infrastructure first, I think it is better to support different use cases
> well and then build common infrastructure to support the different cases.
> I am not against building common infrastructure for pg_rewind and
> pg_basebackup.  I am very much against having the core guarantees being the
> exact same.

Backup solutions are already involved in understanding of replicated
environments as they can be used to back up from replicas rather than
the primary (or perhaps using both in some cases), so I'm not really
sure why you're argueing that backups are somehow different between
non-replicated and replicated environments.

As it relates to the question about if the core guarantees between
pg_basebackup and pg_rewind being the same or not, I've not see much
argument for why they'd be different.  The intent of pg_rewind is to do
what pg_basebackup would do, but in a more efficient manner, as
discussed in the documentation for pg_rewind.

I would think the next step here, as Michael suggested very early on in
this thread, would be to bring the exclude list and perhaps logic for
pg_basebackup into the common code and have pg_rewind leverage that
instead of having its own code that largely does the same and then
adding an option to exclude additional items to that.  There's no sense
having pg_rewind operate on files that are going to end up getting wiped
out when recovery starts anyway.  Perhaps there's a use-case for
overriding the exclude list with a 'include' option too, but I'm not
convinced there is.

Thanks!

Stephen

Attachment

Re: [HACKERS] WIP: Restricting pg_rewind to data/wal dirs

From
Michael Paquier
Date:
On Mon, Jan 22, 2018 at 03:12:58PM -0500, Stephen Frost wrote:
> I would think the next step here, as Michael suggested very early on in
> this thread, would be to bring the exclude list and perhaps logic for
> pg_basebackup into the common code and have pg_rewind leverage that
> instead of having its own code that largely does the same and then
> adding an option to exclude additional items to that.  There's no sense
> having pg_rewind operate on files that are going to end up getting wiped
> out when recovery starts anyway.  Perhaps there's a use-case for
> overriding the exclude list with a 'include' option too, but I'm not
> convinced there is.

Me neither.  I'll look into getting something for the next commit fest.
There have been way too many complaints about how pg_rewind copies too
much data for nothing to ignore doing something (the last one about
pg_replslot data).  A good first step would be what you are writing in
the paragraph above, so I intend to do that as I am sure that it would
be a good addition.  For now I have marked the proposed patches as
returned with feedback.
--
Michael

Attachment