Re: [HACKERS] Determine state of cluster (HA) - Mailing list pgsql-hackers

From Jehan-Guillaume de Rorthais
Subject Re: [HACKERS] Determine state of cluster (HA)
Date
Msg-id 20171016141025.6bc022b8@firost
Whole thread Raw
In response to Re: [HACKERS] Determine state of cluster (HA)  (Craig Ringer <craig@2ndquadrant.com>)
List pgsql-hackers
On Mon, 16 Oct 2017 10:39:16 +0800
Craig Ringer <craig@2ndquadrant.com> wrote:

> On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote:

> > I had a long call with a firm developing front end proxy/cache/HA for
> > Postgres today. Essentially the software is a replacement for PGPool in
> > entirety but also supports analytics etc... When I was asking them about
> > pain points they talked about the below and I was wondering if this is a
> > problem we would like to solve.  
> 
> IMO: no one node knows the full state of the system, or can know it.

+1

> I'd love PostgreSQL to help users more with scaling, HA, etc. But I
> think it's a big job. We'd need:
> 
> - a node topology of some kind, communicated between nodes
> - heartbeat and monitoring
> - failover coordination
> - pooling/proxying
> - STONITH/fencing
> - etc.

And some of items on this list can not be in core. However, there's some things
PostgreSQL can do to make HA easier to deal with.

> That said, I do think it'd be very desirable for us to introduce a
> greater link from a standby to master:
> 
> - Get info about master. We should finish merging recovery.conf into
> postgresql.conf.

agree. +1.

To make things easier from the "cluster manager" piece outside of PostgreSQL, I
would add: 

* being able to "demote" a master as a standby without restart.
* being able to check the status of each node without eating a backend connection (to avoid hitting "max_connection"
limit)
* being able to monitor each step of a switchover (or "controlled failover": standby/master role swapping between two
nodes)

> > b. Attempt to connect to the host directly, if not...
> > c. use the slave and use the hostname via dblink to connect to the master,
> > as the hostname , i.e. select * from dblink('" + connInfo + "
> > dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet).
> > This is necessary in the event the hostname used in the recovery.conf file
> > is not resolvable from the outside.  
> 
> OK, so "connect directly" here means from some 3rd party, the one
> doing the querying of the replica.

It seems to me the failover process is issuing all required commands to move the
master role to another available standby. The knowledge of the orchestration
and final status (if everything went good) is in this piece of software. If you
want to know where is your master in an exotic or complex setup, ask who was
responsible to promote your master.

HA should stay as simple as possible. The more the architecture is complex, the
more you will have failing scenarios.

> > 1.  The dblink call doesn't have a way to specify a timeout, so we have to
> > use Java futures to control how long this may take to a reasonable amount of
> > time;  
> 
> statement_timeout doesn't work?
> 
> If not, that sounds like a sensible, separate feature to add. Patches welcome!
> 
> > 2.  NAT mapping may result in us detecting IP ranges that are not accessible
> > to the application nodes.  
> 
> PostgreSQL can't do anything about this one.

You could get the master IP address from the "pg_stat_wal_receiver" view. But
this is still not enough though. You might have dedicated networks for
applications and for pgsql replication both separated. If you want a standby
to tell the application where to connect to the master then you'll have to
put this information yourself somewhere, accessible from application nodes.

> > 3.  there is no easy way to monitor for state changes as they happen,
> > allowing faster failovers, everything has to be polled based on events;  

In the corosync world (the clustering piece of the Pacemaker ecosystem), node
failure are detected really really fast. About 1s.

Considering application failure (pgsql here), this will be polling, yes. But I
fail to imagine how a dying application can warn the cluster before dying. Not
only crashing (systemd could help there), but eg. before entering an infinite
dummy loop or an exhausting one.

> It'd be pretty simple to write a function that sleeps in the backend
> until it's promoted. I don't know off the top of my head if we set all
> proc latches when we promote, but if we don't it's probably harmless
> and somewhat useful to do so.

As soon as the cluster manager promoted a new master, it can trigger and event
to notify whatever you need.

> Either way, you'd do long-polling. Call the function and let the
> connection block until something interesting happens. Use TCP
> keepalives to make sure you notice if it dies. Have the function
> return when the state changes.

This would still rely on TCP keepalive frequency, back to polling :(

Regards,


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: RADIX Alain - externe
Date:
Subject: [HACKERS] ERROR: MultiXactId 3268957 has not been created yet -- apparentwraparound after missused pg_resetxlogs
Next
From: alain radix
Date:
Subject: [HACKERS] ERROR: MultiXactId 3268957 has not been created yet -- apparentwraparound after missused pg_resetxlogs