Thread: Postgres PAF setup

Postgres PAF setup

From
Andrew Edenburn
Date:

I am having issues with my PAF setup.  I am new to Postgres and have setup the cluster as seen below. 

I am getting this error when trying to start my cluster resources.

 

Master/Slave Set: pgsql-ha [pgsqld]

     pgsqld     (ocf::heartbeat:pgsqlms):       FAILED dcmilphlum224 (unmanaged)

     pgsqld     (ocf::heartbeat:pgsqlms):       FAILED dcmilphlum223 (unmanaged)

pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started dcmilphlum223

 

Failed Actions:

* pgsqld_stop_0 on dcmilphlum224 'unknown error' (1): call=239, status=complete, exitreason='Unexpected state for instance "pgsqld" (returned 1)',

    last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=95ms

* pgsqld_stop_0 on dcmilphlum223 'unknown error' (1): call=248, status=complete, exitreason='Unexpected state for instance "pgsqld" (returned 1)',

    last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=89ms

 

cleanup and clear is not fixing any issues and I am not seeing anything in the logs.  Any help would be greatly appreciated.

 

 

My cluster config

root@dcmilphlum223:/usr/lib/ocf/resource.d/heartbeat# crm config

crm(live)configure# show

node 1: dcmilphlum223

node 2: dcmilphlum224 \

        attributes pgsqld-data-status=LATEST

primitive pgsql-master-ip IPaddr2 \

        params ip=10.125.75.188 cidr_netmask=23 nic=bond0.283 \

        op monitor interval=10s \

        meta target-role=Started

primitive pgsqld pgsqlms \

        params pgdata="/pgsql/data/pg7000" bindir="/usr/local/pgsql/bin" pgport=7000 start_opts="-c config_file=/pgsql/data/pg7000/postgresql.conf" recovery_template="/pgsql/data/pg7000/recovery.conf.pcmk" \

        op start interval=0 timeout=60s \

        op stop interval=0 timeout=60s \

        op promote interval=0 timeout=30s \

        op demote interval=0 timeout=120s \

        op monitor enabled=true interval=15s role=Master timeout=10s \

        op monitor enabled=true interval=16s role=Slave timeout=10s \

        op notify interval=0 timeout=60s \

        meta

ms pgsql-ha pgsqld \

        meta notify=true target-role=Stopped

property cib-bootstrap-options: \

        have-watchdog=false \

        dc-version=1.1.14-70404b0 \

        cluster-infrastructure=corosync \

        cluster-name=pgsql_cluster \

        stonith-enabled=false \

        no-quorum-policy=ignore \

        migration-threshold=1 \

        last-lrm-refresh=1524503476

rsc_defaults rsc_defaults-options: \

        migration-threshold=5 \

        resource-stickiness=10

crm(live)configure#

 

 

My pcs Config

Corosync Nodes:

dcmilphlum223 dcmilphlum224

Pacemaker Nodes:

dcmilphlum223 dcmilphlum224

 

Resources:

Master: pgsql-ha

  Meta Attrs: notify=true target-role=Stopped

  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)

   Attributes: pgdata=/pgsql/data/pg7000 bindir=/usr/local/pgsql/bin pgport=7000 start_opts="-c config_file=/pgsql/data/pg7000/postgresql.conf" recovery_template=/pgsql/data/pg7000/recovery.conf.pcmk

   Operations: start interval=0 timeout=60s (pgsqld-start-0)

               stop interval=0 timeout=60s (pgsqld-stop-0)

               promote interval=0 timeout=30s (pgsqld-promote-0)

               demote interval=0 timeout=120s (pgsqld-demote-0)

               monitor role=Master timeout=10s interval=15s enabled=true (pgsqld-monitor-interval-15s)

               monitor role=Slave timeout=10s interval=16s enabled=true (pgsqld-monitor-interval-16s)

               notify interval=0 timeout=60s (pgsqld-notify-0)

Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2)

  Attributes: ip=10.125.75.188 cidr_netmask=23 nic=bond0.283

  Meta Attrs: target-role=Started

  Operations: monitor interval=10s (pgsql-master-ip-monitor-10s)

 

Stonith Devices:

Fencing Levels:

 

Location Constraints:

Ordering Constraints:

Colocation Constraints:

 

Resources Defaults:

migration-threshold: 5

resource-stickiness: 10

Operations Defaults:

No defaults set

 

Cluster Properties:

cluster-infrastructure: corosync

cluster-name: pgsql_cluster

dc-version: 1.1.14-70404b0

have-watchdog: false

last-lrm-refresh: 1524503476

migration-threshold: 1

no-quorum-policy: ignore

stonith-enabled: false

Node Attributes:

dcmilphlum224: pgsqld-data-status=LATEST

 

 

cid:0__=0ABBF137DFA7B0688f9e8a93df93869091@local
Andrew A Edenburn
General Motors

Hyperscale Computing & Core Engineering
Mobile Phone: +01-810-410-6008

30009 Van Dyke Ave

Warren, MI. 48090-9026

Cube: 2w05-21
mailto:andrew.edenburn@gm.com

Web Connect SoftPhone 586-986-4864

 



Nothing in this message is intended to constitute an electronic signature unless a specific statement to the contrary is included in this message.

Confidentiality Note: This message is intended only for the person or entity to which it is addressed. It may contain confidential and/or privileged material. Any review, transmission, dissemination or other use, or taking of any action in reliance upon this message by persons or entities other than the intended recipient is prohibited and may be unlawful. If you received this message in error, please contact the sender and delete it from your computer.
Attachment

Re: Postgres PAF setup

From
Adrien Nayrat
Date:
On 04/23/2018 08:09 PM, Andrew Edenburn wrote:
> I am having issues with my PAF setup.  I am new to Postgres and have setup the
> cluster as seen below. 
>
> I am getting this error when trying to start my cluster resources.
>
>  
>
> Master/Slave Set: pgsql-ha [pgsqld]
>
>      pgsqld     (ocf::heartbeat:pgsqlms):       FAILED dcmilphlum224 (unmanaged)
>
>      pgsqld     (ocf::heartbeat:pgsqlms):       FAILED dcmilphlum223 (unmanaged)
>
> pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started dcmilphlum223
>
>  
>
> Failed Actions:
>
> * pgsqld_stop_0 on dcmilphlum224 'unknown error' (1): call=239, status=complete,
> exitreason='Unexpected state for instance "pgsqld" (returned 1)',
>
>     last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=95ms
>
> * pgsqld_stop_0 on dcmilphlum223 'unknown error' (1): call=248, status=complete,
> exitreason='Unexpected state for instance "pgsqld" (returned 1)',
>
>     last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=89ms
>
>  
>
> cleanup and clear is not fixing any issues and I am not seeing anything in the
> logs.  Any help would be greatly appreciated.
>
>  

Hello Andrew,

Could you enable debug logs in Pacemaker?

With Centos you have to edit PCMK_debug variable in /etc/sysconfig/pacemaker :

PCMK_debug=crmd,pengine,lrmd

This should give you more information in logs. Monitor action in PAF should
report why the cluster doesn't start :
https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L1525

Regards,

--
Adrien NAYRAT


Attachment

Re: Postgres PAF setup

From
"Jehan-Guillaume (ioguix) de Rorthais"
Date:
On Mon, 23 Apr 2018 18:09:43 +0000
Andrew Edenburn <andrew.edenburn@gm.com> wrote:

> I am having issues with my PAF setup.  I am new to Postgres and have setup
> the cluster as seen below. I am getting this error when trying to start my
> cluster resources.
> [...]
> 
> cleanup and clear is not fixing any issues and I am not seeing anything in
> the logs.  Any help would be greatly appreciated.

This lack a lot of information.

According to the PAF ressource agent, your instances are in an "unexpected
state" on both nodes while PAF was actually trying to stop it.

Pacemaker might decide to stop a ressource if the start operation fails.
Stopping it when the start failed give some chances to the resource agent to
stop the resource gracefully if still possible.

I suspect you have some setup mistake on both nodes, maybe the exact same one...

You should probably provide your full logs from pacemaker/corosync with timing
information so we can check all the messages coming from PAF from the very
beginning of the startup attempt.


>         have-watchdog=false \

you should probably consider to setup watchdog in your cluster.

>         stonith-enabled=false \

This is really bad. Your cluster will NOT work as expected. PAF **requires**
Stonith to be enabled and to properly working. Without it, soon or later, you
will experience some unexpected reaction from the cluster (freezing all
actions, etc).

>         no-quorum-policy=ignore \

You should not ignore quorum, even in a two node cluster. See "two_node"
parameter in the manual of corosync.conf.

>         migration-threshold=1 \
> rsc_defaults rsc_defaults-options: \
>         migration-threshold=5 \

The later is the supported way to set migration-threshold. Your
"migration-threshold=1" should not be a cluster property but a default
ressource option.

> My pcs Config
> Corosync Nodes:
> dcmilphlum223 dcmilphlum224
> Pacemaker Nodes:
> dcmilphlum223 dcmilphlum224
> 
> Resources:
> Master: pgsql-ha
>   Meta Attrs: notify=true target-role=Stopped

This target-role might have been set by the cluster because it can not fence
nodes (which might be easier to deal with in your situation btw). That means
the cluster will keep this resource down because of previous errors.

> recovery_template=/pgsql/data/pg7000/recovery.conf.pcmk

You should probably not put your recovery.conf.pcmk in your PGDATA. Both files
are different between each nodes. As you might want to rebuild the standby or
old master after some failures, you would have to correct it each time. Keep it
outside of the PGDATA to avoid this useless step.

> dcmilphlum224: pgsqld-data-status=LATEST

I suppose this comes from the "pgsql" resource agent, definitely not from PAF...

Regards,


答复: [ClusterLabs] Postgres PAF setup

From
范国腾
Date:
I have meet the similar issue when the postgres is not stopped normally. 
 
You could run pg_controldata to check if your postgres status is shutdown/shutdown in recovery.

I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this problem:

elsif ( $pgisready_rc == 2 ) {
# The instance is not listening.
# We check the process status using pg_ctl status and check
# if it was propertly shut down using pg_controldata.
ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening',
$OCF_RESOURCE_INSTANCE );
# return _confirm_stopped();       # remove this line
return $OCF_NOT_RUNNING; 
}


-----邮件原件-----
发件人: Users [mailto:users-bounces@clusterlabs.org] 代表 Adrien Nayrat
发送时间: 2018年4月24日 16:16
收件人: Andrew Edenburn <andrew.edenburn@gm.com>; pgsql-general@postgresql.org; users@clusterlabs.org
主题: Re: [ClusterLabs] Postgres PAF setup

On 04/23/2018 08:09 PM, Andrew Edenburn wrote:
> I am having issues with my PAF setup.  I am new to Postgres and have 
> setup the cluster as seen below.
> 
> I am getting this error when trying to start my cluster resources.
> 
>  
> 
> Master/Slave Set: pgsql-ha [pgsqld]
> 
>      pgsqld     (ocf::heartbeat:pgsqlms):       FAILED dcmilphlum224 
> (unmanaged)
> 
>      pgsqld     (ocf::heartbeat:pgsqlms):       FAILED dcmilphlum223 
> (unmanaged)
> 
> pgsql-master-ip        (ocf::heartbeat:IPaddr2):       Started 
> dcmilphlum223
> 
>  
> 
> Failed Actions:
> 
> * pgsqld_stop_0 on dcmilphlum224 'unknown error' (1): call=239, 
> status=complete, exitreason='Unexpected state for instance "pgsqld" 
> (returned 1)',
> 
>     last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=95ms
> 
> * pgsqld_stop_0 on dcmilphlum223 'unknown error' (1): call=248, 
> status=complete, exitreason='Unexpected state for instance "pgsqld" 
> (returned 1)',
> 
>     last-rc-change='Mon Apr 23 13:11:17 2018', queued=0ms, exec=89ms
> 
>  
> 
> cleanup and clear is not fixing any issues and I am not seeing 
> anything in the logs.  Any help would be greatly appreciated.
> 
>  

Hello Andrew,

Could you enable debug logs in Pacemaker?

With Centos you have to edit PCMK_debug variable in /etc/sysconfig/pacemaker :

PCMK_debug=crmd,pengine,lrmd

This should give you more information in logs. Monitor action in PAF should report why the cluster doesn't start :
https://github.com/ClusterLabs/PAF/blob/master/script/pgsqlms#L1525

Regards,

--
Adrien NAYRAT


Re: [ClusterLabs] 答复: Postgres PAF setup

From
Adrien Nayrat
Date:
On 04/25/2018 02:31 AM, 范国腾 wrote:
> I have meet the similar issue when the postgres is not stopped normally.
>
> You could run pg_controldata to check if your postgres status is shutdown/shutdown in recovery.
>
> I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this problem:
>
> elsif ( $pgisready_rc == 2 ) {
> # The instance is not listening.
> # We check the process status using pg_ctl status and check
> # if it was propertly shut down using pg_controldata.
> ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening',
> $OCF_RESOURCE_INSTANCE );
> # return _confirm_stopped();       # remove this line
> return $OCF_NOT_RUNNING;
> }

Hello,

It is a bad idea. The goal of _confirm_stopped is to check if the instance was
properly stopped. If it wasn't you could corrupt your instance.

_confirm_stopped  return $OCF_NOT_RUNNING only if the instance was properly
shutdown :
    elsif ( $controldata_rc == $OCF_NOT_RUNNING ) {
        # The controldata state is consistent, the instance was probably
        # propertly shut down.
        ocf_log( 'debug',
            '_confirm_stopped: instance "%s" controldata indicates that the
instance was propertly shut down',
            $OCF_RESOURCE_INSTANCE );
        return $OCF_NOT_RUNNING;
}

Regards,


--
Adrien NAYRAT



Attachment

答复: [ClusterLabs] 答复: Postgres PAF setup

From
范国腾
Date:
Adrien,

Is there any way to make the cluster recover if the postgres was not properly stopped, such as the lab power off or the
OSreboot?
 

Thanks

-----邮件原件-----
发件人: Adrien Nayrat [mailto:adrien.nayrat@anayrat.info] 
发送时间: 2018年4月25日 15:29
收件人: Cluster Labs - All topics related to open-source clustering welcomed <users@clusterlabs.org>; 范国腾
<fanguoteng@highgo.com>;Andrew Edenburn <andrew.edenburn@gm.com>; pgsql-general@postgresql.org
 
主题: Re: [ClusterLabs] 答复: Postgres PAF setup

On 04/25/2018 02:31 AM, 范国腾 wrote:
> I have meet the similar issue when the postgres is not stopped normally. 
>  
> You could run pg_controldata to check if your postgres status is shutdown/shutdown in recovery.
> 
> I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this problem:
> 
> elsif ( $pgisready_rc == 2 ) {
> # The instance is not listening.
> # We check the process status using pg_ctl status and check # if it 
> was propertly shut down using pg_controldata.
> ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening', 
> $OCF_RESOURCE_INSTANCE );
> # return _confirm_stopped();       # remove this line
> return $OCF_NOT_RUNNING;
> }

Hello,

It is a bad idea. The goal of _confirm_stopped is to check if the instance was properly stopped. If it wasn't you could
corruptyour instance.
 

_confirm_stopped  return $OCF_NOT_RUNNING only if the instance was properly shutdown :
    elsif ( $controldata_rc == $OCF_NOT_RUNNING ) {
        # The controldata state is consistent, the instance was probably
        # propertly shut down.
        ocf_log( 'debug',
            '_confirm_stopped: instance "%s" controldata indicates that the instance was propertly shut down',
            $OCF_RESOURCE_INSTANCE );
        return $OCF_NOT_RUNNING;
}

Regards,


--
Adrien NAYRAT



Re: [ClusterLabs] 答复: 答复: Postgres PAFsetup

From
Jehan-Guillaume de Rorthais
Date:
You should definitely not patch the PAF source code without opening an issue on
github and discuss your changes. As Adrien explained, your changes could
greatly end up with an instance corruption or data loss.

On Wed, 25 Apr 2018 07:45:55 +0000
范国腾 <fanguoteng@highgo.com> wrote:
...
> Is there any way to make the cluster recover if the postgres was not properly
> stopped, such as the lab power off or the OS reboot?

Os graceful reboot is supposed to be a clean shutdown, you should not have
trouble with PostgreSQL.

For real failure scenarios, you must:
* put your cluster in maintenance mode
* fix your PostgreSQL setup and replication
* make sure PostgreSQL is replicating correctly
* make sure to keep the master where the cluster is waiting for it if needed
* make sure to stop your postgresql instances if the cluster is considering it
  is stopped
* switch off maintenance mode
* you might need to start your resource if the cluster kept it stopped.

You can find documentation here and in other pages around:
https://clusterlabs.github.io/PAF/administration.html


> -----邮件原件-----
> 发件人: Adrien Nayrat [mailto:adrien.nayrat@anayrat.info]
> 发送时间: 2018年4月25日 15:29
> 收件人: Cluster Labs - All topics related to open-source clustering welcomed
> <users@clusterlabs.org>; 范国腾 <fanguoteng@highgo.com>; Andrew Edenburn
> <andrew.edenburn@gm.com>; pgsql-general@postgresql.org 主题: Re:
> [ClusterLabs] 答复: Postgres PAF setup
>
> On 04/25/2018 02:31 AM, 范国腾 wrote:
> > I have meet the similar issue when the postgres is not stopped normally.
> >
> > You could run pg_controldata to check if your postgres status is
> > shutdown/shutdown in recovery.
> >
> > I change the /usr/lib/ocf/resource.d/heartbeat/pgsqlms to avoid this
> > problem:
> >
> > elsif ( $pgisready_rc == 2 ) {
> > # The instance is not listening.
> > # We check the process status using pg_ctl status and check # if it
> > was propertly shut down using pg_controldata.
> > ocf_log( 'debug', 'pgsql_monitor: instance "%s" is not listening',
> > $OCF_RESOURCE_INSTANCE );
> > # return _confirm_stopped();       # remove this line
> > return $OCF_NOT_RUNNING;
> > }
>
> Hello,
>
> It is a bad idea. The goal of _confirm_stopped is to check if the instance
> was properly stopped. If it wasn't you could corrupt your instance.
>
> _confirm_stopped  return $OCF_NOT_RUNNING only if the instance was properly
> shutdown : elsif ( $controldata_rc == $OCF_NOT_RUNNING ) {
>         # The controldata state is consistent, the instance was probably
>         # propertly shut down.
>         ocf_log( 'debug',
>             '_confirm_stopped: instance "%s" controldata indicates that the
> instance was propertly shut down', $OCF_RESOURCE_INSTANCE );
>         return $OCF_NOT_RUNNING;
> }
>
> Regards,
>
>
> --
> Adrien NAYRAT
>
>
> _______________________________________________
> Users mailing list: Users@clusterlabs.org
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org



--
Jehan-Guillaume de Rorthais
Dalibo


RE: [EXTERNAL] Re: Postgres PAF setup

From
Andrew Edenburn
Date:
Sorry for the delay.  Here is my corosync.log file.
I have tried making the changes that you requested but still no good.  I know when I configured the cluster to use
pgsqldinstead of pgsqlms  I could at least get the cluster to start. But it was starting the cluster as a master on
bothnodes. 

Thanks for your help...

Andrew A Edenburn
General Motors
Hyperscale Computing & Core Engineering
Mobile Phone: +01-810-410-6008
30009 Van Dyke Ave
Warren, MI. 48090-9026
Cube: 2w05-21
mailto:andrew.edenburn@gm.com
Web Connect SoftPhone 586-986-4864


-----Original Message-----
From: Jehan-Guillaume (ioguix) de Rorthais [mailto:ioguix@free.fr]
Sent: Tuesday, April 24, 2018 11:09 AM
To: Andrew Edenburn <andrew.edenburn@gm.com>
Cc: pgsql-general@postgresql.org; users@clusterlabs.org
Subject: [EXTERNAL] Re: Postgres PAF setup

On Mon, 23 Apr 2018 18:09:43 +0000
Andrew Edenburn <andrew.edenburn@gm.com> wrote:

> I am having issues with my PAF setup.  I am new to Postgres and have
> setup the cluster as seen below. I am getting this error when trying
> to start my cluster resources.
> [...]
>
> cleanup and clear is not fixing any issues and I am not seeing
> anything in the logs.  Any help would be greatly appreciated.

This lack a lot of information.

According to the PAF ressource agent, your instances are in an "unexpected state" on both nodes while PAF was actually
tryingto stop it. 

Pacemaker might decide to stop a ressource if the start operation fails.
Stopping it when the start failed give some chances to the resource agent to stop the resource gracefully if still
possible.

I suspect you have some setup mistake on both nodes, maybe the exact same one...

You should probably provide your full logs from pacemaker/corosync with timing information so we can check all the
messagescoming from PAF from the very beginning of the startup attempt. 


>         have-watchdog=false \

you should probably consider to setup watchdog in your cluster.

>         stonith-enabled=false \

This is really bad. Your cluster will NOT work as expected. PAF **requires** Stonith to be enabled and to properly
working.Without it, soon or later, you will experience some unexpected reaction from the cluster (freezing all actions,
etc).

>         no-quorum-policy=ignore \

You should not ignore quorum, even in a two node cluster. See "two_node"
parameter in the manual of corosync.conf.

>         migration-threshold=1 \
> rsc_defaults rsc_defaults-options: \
>         migration-threshold=5 \

The later is the supported way to set migration-threshold. Your "migration-threshold=1" should not be a cluster
propertybut a default ressource option. 

> My pcs Config
> Corosync Nodes:
> dcmilphlum223 dcmilphlum224
> Pacemaker Nodes:
> dcmilphlum223 dcmilphlum224
>
> Resources:
> Master: pgsql-ha
>   Meta Attrs: notify=true target-role=Stopped

This target-role might have been set by the cluster because it can not fence nodes (which might be easier to deal with
inyour situation btw). That means the cluster will keep this resource down because of previous errors. 

> recovery_template=/pgsql/data/pg7000/recovery.conf.pcmk

You should probably not put your recovery.conf.pcmk in your PGDATA. Both files are different between each nodes. As you
mightwant to rebuild the standby or old master after some failures, you would have to correct it each time. Keep it
outsideof the PGDATA to avoid this useless step. 

> dcmilphlum224: pgsqld-data-status=LATEST

I suppose this comes from the "pgsql" resource agent, definitely not from PAF...

Regards,


Nothing in this message is intended to constitute an electronic signature unless a specific statement to the contrary
isincluded in this message. 

Confidentiality Note: This message is intended only for the person or entity to which it is addressed. It may contain
confidentialand/or privileged material. Any review, transmission, dissemination or other use, or taking of any action
inreliance upon this message by persons or entities other than the intended recipient is prohibited and may be
unlawful.If you received this message in error, please contact the sender and delete it from your computer. 

Attachment