Thread: pg_rewind : feature to rewind promoted standby is broken!
I think pg_rewind's feature to rewind the promoted standby as a new standby is broken in 11 STEPS: 1. create master standby setup. Use below script for same. 2. Promote the standby [mithuncy@localhost pgrewmasterbin]$ ./bin/pg_ctl -D standby promote waiting for server to promote.... done server promoted 3. In promoted standby create a database and a table in the new database. [mithuncy@localhost pgrewmasterbin]$ ./bin/psql -p 5433 postgres postgres=# create database db1; CREATE DATABASE postgres=# \c db1 You are now connected to database "db1" as user "mithuncy". db1=# create table t1 (t int); CREATE TABLE 4. try to rewind the newly promoted standby (with old master as source) [mithuncy@localhost pgrewmasterbin]$ ./bin/pg_ctl -D standby stop waiting for server to shut down....... done server stopped [mithuncy@localhost pgrewmasterbin]$ ./bin/pg_rewind -D standby --source-server="host=127.0.0.1 port=5432 user=mithuncy dbname=postgres" servers diverged at WAL location 0/3000060 on timeline 1 rewinding from last common checkpoint at 0/2000060 on timeline 1 could not remove directory "standby/base/16384": Directory not empty Failure, exiting Note: dry run was successful! [mithuncy@localhost pgrewmasterbin]$ ./bin/pg_rewind -D standby --source-server="host=127.0.0.1 port=5432 user=mithuncy dbname=postgres" -n servers diverged at WAL location 0/3000060 on timeline 1 rewinding from last common checkpoint at 0/2000060 on timeline 1 Done! Also I have tested same in version 10 it works fine there. Did below commit has broken this feature? (Thanks to kuntal for identifying same) commit 266b6acb312fc440c1c1a2036aa9da94916beac6 Author: Fujii Masao <fujii@postgresql.org> Date: Thu Mar 29 04:56:52 2018 +0900 Make pg_rewind skip files and directories that are removed during server start. -- Thanks and Regards Mithun Chicklore Yogendra EnterpriseDB: http://www.enterprisedb.com
Attachment
On Tue, Mar 12, 2019 at 01:23:51PM +0530, Mithun Cy wrote: > I think pg_rewind's feature to rewind the promoted standby as a new > standby is broken in 11 Confirmed, it is. > Also I have tested same in version 10 it works fine there. > > Did below commit has broken this feature? (Thanks to kuntal for > identifying same) > commit 266b6acb312fc440c1c1a2036aa9da94916beac6 > Author: Fujii Masao <fujii@postgresql.org> > Date: Thu Mar 29 04:56:52 2018 +0900 > Make pg_rewind skip files and directories that are removed during server start. And you are pointing out to the correct commit. The issue is that process_target_file() has added a call to check_file_excluded(), and this skips all the folders which it thinks can be skipped. One problem though is that we also filter out pg_internal.init, which is present in each database folder, and remains in the target directory marked for deletion. Then, when the deletion happens, the failure happens as the directory is not fully empty. We could consider using rmtree() instead, but it is a nice sanity check to make sure that all the entries in a path have been marked for deletion. Just removing the filter check on the target is fine I think, as we only should have the filters anyway to avoid copying unnecessary files from the source. Attached is a patch. What do you think? (check_file_excluded() could be simplified further but it's also nice to keep some mirroring in this API if we finish by using it for target files at some point.) -- Michael
Attachment
On Tue, Mar 12, 2019 at 01:23:51PM +0530, Mithun Cy wrote: > I think pg_rewind's feature to rewind the promoted standby as a new > standby is broken in 11 Confirmed, it is. > Also I have tested same in version 10 it works fine there. > > Did below commit has broken this feature? (Thanks to kuntal for > identifying same) > commit 266b6acb312fc440c1c1a2036aa9da94916beac6 > Author: Fujii Masao <fujii@postgresql.org> > Date: Thu Mar 29 04:56:52 2018 +0900 > Make pg_rewind skip files and directories that are removed during server start. And you are pointing out to the correct commit. The issue is that process_target_file() has added a call to check_file_excluded(), and this skips all the folders which it thinks can be skipped. One problem though is that we also filter out pg_internal.init, which is present in each database folder, and remains in the target directory marked for deletion. Then, when the deletion happens, the failure happens as the directory is not fully empty. We could consider using rmtree() instead, but it is a nice sanity check to make sure that all the entries in a path have been marked for deletion. Just removing the filter check on the target is fine I think, as we only should have the filters anyway to avoid copying unnecessary files from the source. Attached is a patch. What do you think? (check_file_excluded() could be simplified further but it's also nice to keep some mirroring in this API if we finish by using it for target files at some point.) -- Michael
On Tue, Mar 12, 2019 at 06:23:01PM +0900, Michael Paquier wrote: > And you are pointing out to the correct commit. The issue is that > process_target_file() has added a call to check_file_excluded(), and > this skips all the folders which it thinks can be skipped. One > problem though is that we also filter out pg_internal.init, which is > present in each database folder, and remains in the target directory > marked for deletion. Then, when the deletion happens, the failure > happens as the directory is not fully empty. Okay, here is a refined patch with better comments, the addition of a test case (creating tables in the new databases in 002_databases.pl is enough to trigger the problem). Could you check that it fixes the issue on your side? -- Michael
On Tue, Mar 12, 2019 at 06:23:01PM +0900, Michael Paquier wrote: > And you are pointing out to the correct commit. The issue is that > process_target_file() has added a call to check_file_excluded(), and > this skips all the folders which it thinks can be skipped. One > problem though is that we also filter out pg_internal.init, which is > present in each database folder, and remains in the target directory > marked for deletion. Then, when the deletion happens, the failure > happens as the directory is not fully empty. Okay, here is a refined patch with better comments, the addition of a test case (creating tables in the new databases in 002_databases.pl is enough to trigger the problem). Could you check that it fixes the issue on your side? -- Michael
Attachment
On Wed, Mar 13, 2019 at 1:38 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 12, 2019 at 06:23:01PM +0900, Michael Paquier wrote:
> And you are pointing out to the correct commit. The issue is that
> process_target_file() has added a call to check_file_excluded(), and
> this skips all the folders which it thinks can be skipped. One
> problem though is that we also filter out pg_internal.init, which is
> present in each database folder, and remains in the target directory
> marked for deletion. Then, when the deletion happens, the failure
> happens as the directory is not fully empty.
Okay, here is a refined patch with better comments, the addition of a
test case (creating tables in the new databases in 002_databases.pl is
enough to trigger the problem).
I have not looked into the patch but quick test show it has fixed the above issue.
[mithuncy@localhost pgrewindbin]$ ./bin/pg_rewind -D standby --source-server="host=127.0.0.1 port=5432 user=mithuncy dbname=postgres" -n
servers diverged at WAL location 0/3000000 on timeline 1
rewinding from last common checkpoint at 0/2000060 on timeline 1
Done!
[mithuncy@localhost pgrewindbin]$ ./bin/pg_rewind -D standby --source-server="host=127.0.0.1 port=5432 user=mithuncy dbname=postgres"
servers diverged at WAL location 0/3000000 on timeline 1
rewinding from last common checkpoint at 0/2000060 on timeline 1
Done!
servers diverged at WAL location 0/3000000 on timeline 1
rewinding from last common checkpoint at 0/2000060 on timeline 1
Done!
[mithuncy@localhost pgrewindbin]$ ./bin/pg_rewind -D standby --source-server="host=127.0.0.1 port=5432 user=mithuncy dbname=postgres"
servers diverged at WAL location 0/3000000 on timeline 1
rewinding from last common checkpoint at 0/2000060 on timeline 1
Done!
--
On Wed, Mar 13, 2019 at 1:38 PM Michael Paquier <michael@paquier.xyz> wrote:
On Tue, Mar 12, 2019 at 06:23:01PM +0900, Michael Paquier wrote:
> And you are pointing out to the correct commit. The issue is that
> process_target_file() has added a call to check_file_excluded(), and
> this skips all the folders which it thinks can be skipped. One
> problem though is that we also filter out pg_internal.init, which is
> present in each database folder, and remains in the target directory
> marked for deletion. Then, when the deletion happens, the failure
> happens as the directory is not fully empty.
Okay, here is a refined patch with better comments, the addition of a
test case (creating tables in the new databases in 002_databases.pl is
enough to trigger the problem).
I have not looked into the patch but quick test show it has fixed the above issue.
[mithuncy@localhost pgrewindbin]$ ./bin/pg_rewind -D standby --source-server="host=127.0.0.1 port=5432 user=mithuncy dbname=postgres" -n
servers diverged at WAL location 0/3000000 on timeline 1
rewinding from last common checkpoint at 0/2000060 on timeline 1
Done!
[mithuncy@localhost pgrewindbin]$ ./bin/pg_rewind -D standby --source-server="host=127.0.0.1 port=5432 user=mithuncy dbname=postgres"
servers diverged at WAL location 0/3000000 on timeline 1
rewinding from last common checkpoint at 0/2000060 on timeline 1
Done!
servers diverged at WAL location 0/3000000 on timeline 1
rewinding from last common checkpoint at 0/2000060 on timeline 1
Done!
[mithuncy@localhost pgrewindbin]$ ./bin/pg_rewind -D standby --source-server="host=127.0.0.1 port=5432 user=mithuncy dbname=postgres"
servers diverged at WAL location 0/3000000 on timeline 1
rewinding from last common checkpoint at 0/2000060 on timeline 1
Done!
--
On Thu, Mar 14, 2019 at 12:15:57AM +0530, Mithun Cy wrote: > I have not looked into the patch but quick test show it has fixed the above > issue. Thanks for confirming, Mythun. I'll think about the logic of this patch for a couple of days in the background, then I'll try to commit it likely at the beginning of next week. -- Michael
Attachment
On Thu, Mar 14, 2019 at 12:15:57AM +0530, Mithun Cy wrote: > I have not looked into the patch but quick test show it has fixed the above > issue. Thanks for confirming, Mythun. I'll think about the logic of this patch for a couple of days in the background, then I'll try to commit it likely at the beginning of next week. -- Michael
On Thu, Mar 14, 2019 at 09:04:41AM +0900, Michael Paquier wrote: > Thanks for confirming, Mythun. I'll think about the logic of this > patch for a couple of days in the background, then I'll try to commit > it likely at the beginning of next week. Committed. I have spent extra time polishing the comments to make the filtering rules clearer when processing the source and target files, particularly why they are useful the way they are. -- Michael
Attachment
On Thu, Mar 14, 2019 at 09:04:41AM +0900, Michael Paquier wrote: > Thanks for confirming, Mythun. I'll think about the logic of this > patch for a couple of days in the background, then I'll try to commit > it likely at the beginning of next week. Committed. I have spent extra time polishing the comments to make the filtering rules clearer when processing the source and target files, particularly why they are useful the way they are. -- Michael