pg cluster not cleaning up after failover - Mailing list pgsql-admin

From Peter Brunnengräber
Subject pg cluster not cleaning up after failover
Date
Msg-id 1941931359.108.1468425520834.JavaMail.pbrunnen@Station8.local
Whole thread Raw
List pgsql-admin
Hello all,
  I'm having an issue with a postgresql 9.2 cluster during failover and hope you all can help.  I have been attempting
tofollow the guide provided at ClusterLabs(1) but not having much luck and I don't quite understand where the issue is.
I'm running on debian wheezy. 

  I have my crm_mon output below.  One server is PRI and operating normally after taking over.  I have pg setup to do
thewal archiving via rsync to the opposite node.  <archive_command = 'rsync -a %p
test-node2:/db/data/postgresql/9.2/pg_archive/%f'> The rsync is working and I do see WAL files going to the other host
appropriately.

  Node2 was the PRI... So after node1 that was previously in HA:sync promoted last night to PRI and node2 is stopped.
TheWAL files are arriving from node1 on node2.  I cleaned-up the /tmp/PGSQL.lock file and proceed with a pg_basebackup
restorefrom node1.  This all went well without error in the node1 postgresql log. 

  After running a crm cleanup on the msPostgresql resource, node2 keeps showing 'LATEST' but gets hung up at HS:alone.
PlusI don't understand why the xlog-loc of node2 shows 0000001EB9053DD8 which is farther ahead of node1's
master-baselineof 0000001EB2000080.  I saw the 'cannot stat ... 000000010000001E000000BB' error, but that seems to
alwayshappen for the current xlog filename. 

  And if I wasn't confused enough, the pg log on node2 says "streaming replication successfully connected to primary"
andthe pg_stat_replication query on node1 shows connected, but ASYNC. 


Any ideas?


Very much appreciated!
-With kind regards,
 Peter Brunnengräber



References:
(1) http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#after_fail-over


###
============
Last updated: Wed Jul 13 14:51:53 2016
Last change: Wed Jul 13 14:49:17 2016 via crmd on test-node2
Stack: openais
Current DC: test-node1 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ test-node1 test-node2 ]

Full list of resources:

 Resource Group: g_master
     ClusterIP-Net1     (ocf::heartbeat:IPaddr2):       Started test-node1
     ReplicationIP-Net2 (ocf::heartbeat:IPaddr2):       Started test-node1
 Master/Slave Set: msPostgresql [pgsql]
     Masters: [ test-node1 ]
     Slaves: [ test-node2 ]

Node Attributes:
* Node test-node1:
    + master-pgsql:0                    : 1000
    + master-pgsql:1                    : 1000
    + pgsql-data-status                 : LATEST
    + pgsql-master-baseline             : 0000001EB2000080
    + pgsql-status                      : PRI
* Node test-node2:
    + master-pgsql:0                    : -INFINITY
    + master-pgsql:1                    : -INFINITY
    + pgsql-data-status                 : LATEST
    + pgsql-status                      : HS:alone
    + pgsql-xlog-loc                    : 0000001EB9053DD8

Migration summary:
* Node test-node2:
* Node test-node1:


#### Node2
2016-07-13 14:55:09 UTC LOG:  database system was interrupted; last known up at 2016-07-13 14:54:27 UTC
2016-07-13 14:55:09 UTC LOG:  creating missing WAL directory "pg_xlog/archive_status"
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
2016-07-13 14:55:09 UTC LOG:  entering standby mode
2016-07-13 14:55:09 UTC LOG:  restored log file "000000010000001E000000BA" from archive
2016-07-13 14:55:09 UTC FATAL:  the database system is starting up
2016-07-13 14:55:09 UTC LOG:  redo starts at 1E/BA000020
2016-07-13 14:55:09 UTC LOG:  consistent recovery state reached at 1E/BA05FED8
2016-07-13 14:55:09 UTC LOG:  database system is ready to accept read only connections
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/000000010000001E000000BB': No such file or directory
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
2016-07-13 14:55:09 UTC LOG:  streaming replication successfully connected to primary


#### Node1
postgres=# select application_name,upper(state),upper(sync_state) from pg_stat_replication;
+------------------+-----------+-------+
| application_name |   upper   | upper |
+------------------+-----------+-------+
| test-node2       | STREAMING | ASYNC |
+------------------+-----------+-------+
(1 row)



pgsql-admin by date:

Previous
From: Nguyen Hoai Nam
Date:
Subject: Re: The problem is related to concurrent resquests
Next
From: Nguyen Hoai Nam
Date:
Subject: Create extension without superuser