Thread: Postgres PAF setup
I am having issues with my PAF setup. I am new to Postgres and have setup the cluster as seen below.
I am getting this error when trying to start my cluster resources.
Master/Slave Set: pgsql-ha [pgsqld]
pgsqld (ocf::heartbeat:pgsqlms): FAILED dcmilphlum224 (unmanaged)
pgsqld (ocf::heartbeat:pgsqlms): FAILED dcmilphlum223 (unmanaged)
pgsql-master-ip (ocf::heartbeat:IPaddr2): Started dcmilphlum223
Failed Actions:
* pgsqld_stop_0 on dcmilphlum224 'unknown error' (1): call=239, status=complete, exitreason='Unexpected state for instance "pgsqld" (returned 1)',
last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=95ms
* pgsqld_stop_0 on dcmilphlum223 'unknown error' (1): call=248, status=complete, exitreason='Unexpected state for instance "pgsqld" (returned 1)',
last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=89ms
cleanup and clear is not fixing any issues and I am not seeing anything in the logs. Any help would be greatly appreciated.
My cluster config
root@dcmilphlum223:/usr/lib/ocf/resource.d/heartbeat# crm config
crm(live)configure# show
node 1: dcmilphlum223
node 2: dcmilphlum224 \
attributes pgsqld-data-status=LATEST
primitive pgsql-master-ip IPaddr2 \
params ip=10.125.75.188 cidr_netmask=23 nic=bond0.283 \
op monitor interval=10s \
meta target-role=Started
primitive pgsqld pgsqlms \
params pgdata="/pgsql/data/pg7000" bindir="/usr/local/pgsql/bin" pgport=7000 start_opts="-c config_file=/pgsql/data/pg7000/postgresql.conf" recovery_template="/pgsql/data/pg7000/recovery.conf.pcmk" \
op start interval=0 timeout=60s \
op stop interval=0 timeout=60s \
op promote interval=0 timeout=30s \
op demote interval=0 timeout=120s \
op monitor enabled=true interval=15s role=Master timeout=10s \
op monitor enabled=true interval=16s role=Slave timeout=10s \
op notify interval=0 timeout=60s \
meta
ms pgsql-ha pgsqld \
meta notify=true target-role=Stopped
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.14-70404b0 \
cluster-infrastructure=corosync \
cluster-name=pgsql_cluster \
stonith-enabled=false \
no-quorum-policy=ignore \
migration-threshold=1 \
last-lrm-refresh=1524503476
rsc_defaults rsc_defaults-options: \
migration-threshold=5 \
resource-stickiness=10
crm(live)configure#
My pcs Config
Corosync Nodes:
dcmilphlum223 dcmilphlum224
Pacemaker Nodes:
dcmilphlum223 dcmilphlum224
Resources:
Master: pgsql-ha
Meta Attrs: notify=true target-role=Stopped
Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
Attributes: pgdata=/pgsql/data/pg7000 bindir=/usr/local/pgsql/bin pgport=7000 start_opts="-c config_file=/pgsql/data/pg7000/postgresql.conf" recovery_template=/pgsql/data/pg7000/recovery.conf.pcmk
Operations: start interval=0 timeout=60s (pgsqld-start-0)
stop interval=0 timeout=60s (pgsqld-stop-0)
promote interval=0 timeout=30s (pgsqld-promote-0)
demote interval=0 timeout=120s (pgsqld-demote-0)
monitor role=Master timeout=10s interval=15s enabled=true (pgsqld-monitor-interval-15s)
monitor role=Slave timeout=10s interval=16s enabled=true (pgsqld-monitor-interval-16s)
notify interval=0 timeout=60s (pgsqld-notify-0)
Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.125.75.188 cidr_netmask=23 nic=bond0.283
Meta Attrs: target-role=Started
Operations: monitor interval=10s (pgsql-master-ip-monitor-10s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
Colocation Constraints:
Resources Defaults:
migration-threshold: 5
resource-stickiness: 10
Operations Defaults:
No defaults set
Cluster Properties:
cluster-infrastructure: corosync
cluster-name: pgsql_cluster
dc-version: 1.1.14-70404b0
have-watchdog: false
last-lrm-refresh: 1524503476
migration-threshold: 1
no-quorum-policy: ignore
stonith-enabled: false
Node Attributes:
dcmilphlum224: pgsqld-data-status=LATEST
Andrew A Edenburn
General Motors
Hyperscale Computing & Core Engineering
Mobile Phone: +01-810-410-6008
30009 Van Dyke Ave
Warren, MI. 48090-9026
Cube: 2w05-21
mailto:andrew.edenburn@gm.com
Web Connect SoftPhone 586-986-4864
Nothing in this message is intended to constitute an electronic signature unless a specific statement to the contrary is included in this message.
Confidentiality Note: This message is intended only for the person or entity to which it is addressed. It may contain confidential and/or privileged material. Any review, transmission, dissemination or other use, or taking of any action in reliance upon this message by persons or entities other than the intended recipient is prohibited and may be unlawful. If you received this message in error, please contact the sender and delete it from your computer.
Attachment
On 04/23/2018 08:09 PM, Andrew Edenburn wrote: > I am having issues with my PAF setup. I am new to Postgres and have setup the > cluster as seen below. > > I am getting this error when trying to start my cluster resources. > > > > Master/Slave Set: pgsql-ha [pgsqld] > > pgsqld (ocf::heartbeat:pgsqlms): FAILED dcmilphlum224 (unmanaged) > > pgsqld (ocf::heartbeat:pgsqlms): FAILED dcmilphlum223 (unmanaged) > > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started dcmilphlum223 > > > > Failed Actions: > > * pgsqld_stop_0 on dcmilphlum224 'unknown error' (1): call=239, status=complete, > exitreason='Unexpected state for instance "pgsqld" (returned 1)', > > last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=95ms > > * pgsqld_stop_0 on dcmilphlum223 'unknown error' (1): call=248, status=complete, > exitreason='Unexpected state for instance "pgsqld" (returned 1)', > > last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=89ms > > > > cleanup and clear is not fixing any issues and I am not seeing anything in the > logs. Any help would be greatly appreciated. > > Hello Andrew, Could you enable debug logs in Pacemaker? With Centos you have to edit PCMK_debug variable in /etc/sysconfig/pacemaker : PCMK_debug=crmd,pengine,lrmd This should give you more information in logs. Monitor action in PAF should report why the cluster doesn't start : https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L1525 Regards, -- Adrien NAYRAT
Attachment
On Mon, 23 Apr 2018 18:09:43 +0000 Andrew Edenburn <andrew.edenburn@gm.com> wrote: > I am having issues with my PAF setup. I am new to Postgres and have setup > the cluster as seen below. I am getting this error when trying to start my > cluster resources. > [...] > > cleanup and clear is not fixing any issues and I am not seeing anything in > the logs. Any help would be greatly appreciated. This lack a lot of information. According to the PAF ressource agent, your instances are in an "unexpected state" on both nodes while PAF was actually trying to stop it. Pacemaker might decide to stop a ressource if the start operation fails. Stopping it when the start failed give some chances to the resource agent to stop the resource gracefully if still possible. I suspect you have some setup mistake on both nodes, maybe the exact same one... You should probably provide your full logs from pacemaker/corosync with timing information so we can check all the messages coming from PAF from the very beginning of the startup attempt. > have-watchdog=false \ you should probably consider to setup watchdog in your cluster. > stonith-enabled=false \ This is really bad. Your cluster will NOT work as expected. PAF **requires** Stonith to be enabled and to properly working. Without it, soon or later, you will experience some unexpected reaction from the cluster (freezing all actions, etc). > no-quorum-policy=ignore \ You should not ignore quorum, even in a two node cluster. See "two_node" parameter in the manual of corosync.conf. > migration-threshold=1 \ > rsc_defaults rsc_defaults-options: \ > migration-threshold=5 \ The later is the supported way to set migration-threshold. Your "migration-threshold=1" should not be a cluster property but a default ressource option. > My pcs Config > Corosync Nodes: > dcmilphlum223 dcmilphlum224 > Pacemaker Nodes: > dcmilphlum223 dcmilphlum224 > > Resources: > Master: pgsql-ha > Meta Attrs: notify=true target-role=Stopped This target-role might have been set by the cluster because it can not fence nodes (which might be easier to deal with in your situation btw). That means the cluster will keep this resource down because of previous errors. > recovery_template=/pgsql/data/pg7000/recovery.conf.pcmk You should probably not put your recovery.conf.pcmk in your PGDATA. Both files are different between each nodes. As you might want to rebuild the standby or old master after some failures, you would have to correct it each time. Keep it outside of the PGDATA to avoid this useless step. > dcmilphlum224: pgsqld-data-status=LATEST I suppose this comes from the "pgsql" resource agent, definitely not from PAF... Regards,
I have meet the similar issue when the postgres is not stopped normally. You could run pg_controldata to check if your postgres status is shutdown/shutdown in recovery. I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this problem: elsif ( $pgisready_rc == 2 ) { # The instance is not listening. # We check the process status using pg_ctl status and check # if it was propertly shut down using pg_controldata. ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening', $OCF_RESOURCE_INSTANCE ); # return _confirm_stopped(); # remove this line return $OCF_NOT_RUNNING; } -----邮件原件----- 发件人: Users [mailto:users-bounces@clusterlabs.org] 代表 Adrien Nayrat 发送时间: 2018年4月24日 16:16 收件人: Andrew Edenburn <andrew.edenburn@gm.com>; pgsql-general@postgresql.org; users@clusterlabs.org 主题: Re: [ClusterLabs] Postgres PAF setup On 04/23/2018 08:09 PM, Andrew Edenburn wrote: > I am having issues with my PAF setup. I am new to Postgres and have > setup the cluster as seen below. > > I am getting this error when trying to start my cluster resources. > > > > Master/Slave Set: pgsql-ha [pgsqld] > > pgsqld (ocf::heartbeat:pgsqlms): FAILED dcmilphlum224 > (unmanaged) > > pgsqld (ocf::heartbeat:pgsqlms): FAILED dcmilphlum223 > (unmanaged) > > pgsql-master-ip (ocf::heartbeat:IPaddr2): Started > dcmilphlum223 > > > > Failed Actions: > > * pgsqld_stop_0 on dcmilphlum224 'unknown error' (1): call=239, > status=complete, exitreason='Unexpected state for instance "pgsqld" > (returned 1)', > > last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=95ms > > * pgsqld_stop_0 on dcmilphlum223 'unknown error' (1): call=248, > status=complete, exitreason='Unexpected state for instance "pgsqld" > (returned 1)', > > last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=89ms > > > > cleanup and clear is not fixing any issues and I am not seeing > anything in the logs. Any help would be greatly appreciated. > > Hello Andrew, Could you enable debug logs in Pacemaker? With Centos you have to edit PCMK_debug variable in /etc/sysconfig/pacemaker : PCMK_debug=crmd,pengine,lrmd This should give you more information in logs. Monitor action in PAF should report why the cluster doesn't start : https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L1525 Regards, -- Adrien NAYRAT
On 04/25/2018 02:31 AM, 范国腾 wrote: > I have meet the similar issue when the postgres is not stopped normally. > > You could run pg_controldata to check if your postgres status is shutdown/shutdown in recovery. > > I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this problem: > > elsif ( $pgisready_rc == 2 ) { > # The instance is not listening. > # We check the process status using pg_ctl status and check > # if it was propertly shut down using pg_controldata. > ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening', > $OCF_RESOURCE_INSTANCE ); > # return _confirm_stopped(); # remove this line > return $OCF_NOT_RUNNING; > } Hello, It is a bad idea. The goal of _confirm_stopped is to check if the instance was properly stopped. If it wasn't you could corrupt your instance. _confirm_stopped return $OCF_NOT_RUNNING only if the instance was properly shutdown : elsif ( $controldata_rc == $OCF_NOT_RUNNING ) { # The controldata state is consistent, the instance was probably # propertly shut down. ocf_log( 'debug', '_confirm_stopped: instance "%s" controldata indicates that the instance was propertly shut down', $OCF_RESOURCE_INSTANCE ); return $OCF_NOT_RUNNING; } Regards, -- Adrien NAYRAT
Attachment
Adrien, Is there any way to make the cluster recover if the postgres was not properly stopped, such as the lab power off or the OSreboot? Thanks -----邮件原件----- 发件人: Adrien Nayrat [mailto:adrien.nayrat@anayrat.info] 发送时间: 2018年4月25日 15:29 收件人: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org>; 范国腾 <fanguoteng@highgo.com>;Andrew Edenburn <andrew.edenburn@gm.com>; pgsql-general@postgresql.org 主题: Re: [ClusterLabs] 答复: Postgres PAF setup On 04/25/2018 02:31 AM, 范国腾 wrote: > I have meet the similar issue when the postgres is not stopped normally. > > You could run pg_controldata to check if your postgres status is shutdown/shutdown in recovery. > > I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this problem: > > elsif ( $pgisready_rc == 2 ) { > # The instance is not listening. > # We check the process status using pg_ctl status and check # if it > was propertly shut down using pg_controldata. > ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening', > $OCF_RESOURCE_INSTANCE ); > # return _confirm_stopped(); # remove this line > return $OCF_NOT_RUNNING; > } Hello, It is a bad idea. The goal of _confirm_stopped is to check if the instance was properly stopped. If it wasn't you could corruptyour instance. _confirm_stopped return $OCF_NOT_RUNNING only if the instance was properly shutdown : elsif ( $controldata_rc == $OCF_NOT_RUNNING ) { # The controldata state is consistent, the instance was probably # propertly shut down. ocf_log( 'debug', '_confirm_stopped: instance "%s" controldata indicates that the instance was propertly shut down', $OCF_RESOURCE_INSTANCE ); return $OCF_NOT_RUNNING; } Regards, -- Adrien NAYRAT
You should definitely not patch the PAF source code without opening an issue on github and discuss your changes. As Adrien explained, your changes could greatly end up with an instance corruption or data loss. On Wed, 25 Apr 2018 07:45:55 +0000 范国腾 <fanguoteng@highgo.com> wrote: ... > Is there any way to make the cluster recover if the postgres was not properly > stopped, such as the lab power off or the OS reboot? Os graceful reboot is supposed to be a clean shutdown, you should not have trouble with PostgreSQL. For real failure scenarios, you must: * put your cluster in maintenance mode * fix your PostgreSQL setup and replication * make sure PostgreSQL is replicating correctly * make sure to keep the master where the cluster is waiting for it if needed * make sure to stop your postgresql instances if the cluster is considering it is stopped * switch off maintenance mode * you might need to start your resource if the cluster kept it stopped. You can find documentation here and in other pages around: https://clusterlabs.github.io/PAF/administration.html > -----邮件原件----- > 发件人: Adrien Nayrat [mailto:adrien.nayrat@anayrat.info] > 发送时间: 2018年4月25日 15:29 > 收件人: Cluster Labs - All topics related to open-source clustering welcomed > <users@clusterlabs.org>; 范国腾 <fanguoteng@highgo.com>; Andrew Edenburn > <andrew.edenburn@gm.com>; pgsql-general@postgresql.org 主题: Re: > [ClusterLabs] 答复: Postgres PAF setup > > On 04/25/2018 02:31 AM, 范国腾 wrote: > > I have meet the similar issue when the postgres is not stopped normally. > > > > You could run pg_controldata to check if your postgres status is > > shutdown/shutdown in recovery. > > > > I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this > > problem: > > > > elsif ( $pgisready_rc == 2 ) { > > # The instance is not listening. > > # We check the process status using pg_ctl status and check # if it > > was propertly shut down using pg_controldata. > > ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening', > > $OCF_RESOURCE_INSTANCE ); > > # return _confirm_stopped(); # remove this line > > return $OCF_NOT_RUNNING; > > } > > Hello, > > It is a bad idea. The goal of _confirm_stopped is to check if the instance > was properly stopped. If it wasn't you could corrupt your instance. > > _confirm_stopped return $OCF_NOT_RUNNING only if the instance was properly > shutdown : elsif ( $controldata_rc == $OCF_NOT_RUNNING ) { > # The controldata state is consistent, the instance was probably > # propertly shut down. > ocf_log( 'debug', > '_confirm_stopped: instance "%s" controldata indicates that the > instance was propertly shut down', $OCF_RESOURCE_INSTANCE ); > return $OCF_NOT_RUNNING; > } > > Regards, > > > -- > Adrien NAYRAT > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- Jehan-Guillaume de Rorthais Dalibo
Sorry for the delay. Here is my corosync.log file. I have tried making the changes that you requested but still no good. I know when I configured the cluster to use pgsqldinstead of pgsqlms I could at least get the cluster to start. But it was starting the cluster as a master on bothnodes. Thanks for your help... Andrew A Edenburn General Motors Hyperscale Computing & Core Engineering Mobile Phone: +01-810-410-6008 30009 Van Dyke Ave Warren, MI. 48090-9026 Cube: 2w05-21 mailto:andrew.edenburn@gm.com Web Connect SoftPhone 586-986-4864 -----Original Message----- From: Jehan-Guillaume (ioguix) de Rorthais [mailto:ioguix@free.fr] Sent: Tuesday, April 24, 2018 11:09 AM To: Andrew Edenburn <andrew.edenburn@gm.com> Cc: pgsql-general@postgresql.org; users@clusterlabs.org Subject: [EXTERNAL] Re: Postgres PAF setup On Mon, 23 Apr 2018 18:09:43 +0000 Andrew Edenburn <andrew.edenburn@gm.com> wrote: > I am having issues with my PAF setup. I am new to Postgres and have > setup the cluster as seen below. I am getting this error when trying > to start my cluster resources. > [...] > > cleanup and clear is not fixing any issues and I am not seeing > anything in the logs. Any help would be greatly appreciated. This lack a lot of information. According to the PAF ressource agent, your instances are in an "unexpected state" on both nodes while PAF was actually tryingto stop it. Pacemaker might decide to stop a ressource if the start operation fails. Stopping it when the start failed give some chances to the resource agent to stop the resource gracefully if still possible. I suspect you have some setup mistake on both nodes, maybe the exact same one... You should probably provide your full logs from pacemaker/corosync with timing information so we can check all the messagescoming from PAF from the very beginning of the startup attempt. > have-watchdog=false \ you should probably consider to setup watchdog in your cluster. > stonith-enabled=false \ This is really bad. Your cluster will NOT work as expected. PAF **requires** Stonith to be enabled and to properly working.Without it, soon or later, you will experience some unexpected reaction from the cluster (freezing all actions, etc). > no-quorum-policy=ignore \ You should not ignore quorum, even in a two node cluster. See "two_node" parameter in the manual of corosync.conf. > migration-threshold=1 \ > rsc_defaults rsc_defaults-options: \ > migration-threshold=5 \ The later is the supported way to set migration-threshold. Your "migration-threshold=1" should not be a cluster propertybut a default ressource option. > My pcs Config > Corosync Nodes: > dcmilphlum223 dcmilphlum224 > Pacemaker Nodes: > dcmilphlum223 dcmilphlum224 > > Resources: > Master: pgsql-ha > Meta Attrs: notify=true target-role=Stopped This target-role might have been set by the cluster because it can not fence nodes (which might be easier to deal with inyour situation btw). That means the cluster will keep this resource down because of previous errors. > recovery_template=/pgsql/data/pg7000/recovery.conf.pcmk You should probably not put your recovery.conf.pcmk in your PGDATA. Both files are different between each nodes. As you mightwant to rebuild the standby or old master after some failures, you would have to correct it each time. Keep it outsideof the PGDATA to avoid this useless step. > dcmilphlum224: pgsqld-data-status=LATEST I suppose this comes from the "pgsql" resource agent, definitely not from PAF... Regards, Nothing in this message is intended to constitute an electronic signature unless a specific statement to the contrary isincluded in this message. Confidentiality Note: This message is intended only for the person or entity to which it is addressed. It may contain confidentialand/or privileged material. Any review, transmission, dissemination or other use, or taking of any action inreliance upon this message by persons or entities other than the intended recipient is prohibited and may be unlawful.If you received this message in error, please contact the sender and delete it from your computer.