Thread: [HACKERS] Determine state of cluster (HA)
-Hackers, I had a long call with a firm developing front end proxy/cache/HA for Postgres today. Essentially the software is a replacement for PGPool in entirety but also supports analytics etc... When I was asking them about pain points they talked about the below and I was wondering if this is a problem we would like to solve: Per your request, here is our failover issue. 1. In a modern devops environment, the database should be able to scale and morph over time based on need. 2. Tools that are leveraging the database should be able to easily discover and potentially control (with permissions) the database. Currently, you can discover the master and what nodes are syncing off of it, but on a failure, a tool can't easily discover what orchestration has done on the back-end to make the cluster whole again, i.e. from the slave, you can't discover the master reliably and easily. The logic that our code now uses is to: 1. Find the master 2. Add replication nodes per the master's configuration. To find a master, we start with a list of candidate nodes that MAY be a master at any point, and: 1. issue "SELECT pg_is_in_recovery()" to find if it is a slave a. If so, use "SELECT pg_read_file('recovery.conf')" to extract the host b. Attempt to connect to the host directly, if not... c. use the slave and use the hostname via dblink to connect to the master, as the hostname , i.e. select * from dblink('" + connInfo + " dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet). This is necessary in the event the hostname used in the recovery.conf file is not resolvable from the outside. d. Use the dblink connection to ID the master node via select inet_server_addr(); e. connect to the IP provided by the master. f. Repeat through nodes until we get a master. Issues: 1. The dblink call doesn't have a way to specify a timeout, so we have to use Java futures to control how long this may take to a reasonable amount of time; 2. NAT mapping may result in us detecting IP ranges that are not accessible to the application nodes. 3. there is no easy way to monitor for state changes as they happen, allowing faster failovers, everything has to be polled based on events; 4. It doesn't support cascading replication very well, although we could augment the logic to allow us to map the relationship between nodes. 5. There is no way to connect to a db node with something akin to SQL-Server's "application intent" flags, to allow a connection to be rejected if we wish it to be a read/write connection. This helps detect the state of the node directly without having to ask any further questions of the node, and makes it easier to "stall" during connection until a proper connection can be made. 6. The master, on shutdown, will not actually close and disable connections as it shuts down, instead, it will issue an error that it is shutting down as it does so. Fundamentally, the biggest issue is that it is very hard to determine the state of the cluster by asking all the nodes, in particular in the case of a failure. Some state information is lost that is necessary to talk to the cluster moving forward in a reliable manner. -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc PostgreSQL Centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://pgconf.us ***** Unless otherwise stated, opinions are my own. ***** -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 10/12/2017 05:50 PM, Joshua D. Drake wrote: > -Hackers, Bumping this. > > I had a long call with a firm developing front end proxy/cache/HA for > Postgres today. Essentially the software is a replacement for PGPool in > entirety but also supports analytics etc... When I was asking them about > pain points they talked about the below and I was wondering if this is a > problem we would like to solve: > > Per your request, here is our failover issue. > > 1. In a modern devops environment, the database should be able to scale > and morph over time based on need. > 2. Tools that are leveraging the database should be able to easily > discover and potentially control (with permissions) the database. > Currently, you can discover the master and what nodes are syncing off of > it, but on a failure, a tool can't easily discover what orchestration > has done on the back-end to make the cluster whole again, i.e. from the > slave, you can't discover the master reliably and easily. > > The logic that our code now uses is to: > > 1. Find the master > 2. Add replication nodes per the master's configuration. > > To find a master, we start with a list of candidate nodes that MAY be a > master at any point, and: > 1. issue "SELECT pg_is_in_recovery()" to find if it is a slave > a. If so, use "SELECT pg_read_file('recovery.conf')" to extract the host > b. Attempt to connect to the host directly, if not... > c. use the slave and use the hostname via dblink to connect to the > master, as the hostname , i.e. select * from dblink('" + connInfo + " > dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr > inet). This is necessary in the event the hostname used in the > recovery.conf file is not resolvable from the outside. > d. Use the dblink connection to ID the master node via select > inet_server_addr(); > e. connect to the IP provided by the master. > f. Repeat through nodes until we get a master. > > Issues: > 1. The dblink call doesn't have a way to specify a timeout, so we have > to use Java futures to control how long this may take to a reasonable > amount of time; > 2. NAT mapping may result in us detecting IP ranges that are not > accessible to the application nodes. > 3. there is no easy way to monitor for state changes as they happen, > allowing faster failovers, everything has to be polled based on events; > 4. It doesn't support cascading replication very well, although we > could augment the logic to allow us to map the relationship between nodes. > 5. There is no way to connect to a db node with something akin to > SQL-Server's "application intent" flags, to allow a connection to be > rejected if we wish it to be a read/write connection. This helps detect > the state of the node directly without having to ask any further > questions of the node, and makes it easier to "stall" during connection > until a proper connection can be made. > 6. The master, on shutdown, will not actually close and disable > connections as it shuts down, instead, it will issue an error that it is > shutting down as it does so. > > Fundamentally, the biggest issue is that it is very hard to determine > the state of the cluster by asking all the nodes, in particular in the > case of a failure. Some state information is lost that is necessary to > talk to the cluster moving forward in a reliable manner. > > -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc PostgreSQL Centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://pgconf.us ***** Unless otherwise stated, opinions are my own. ***** -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote: > -Hackers, > > I had a long call with a firm developing front end proxy/cache/HA for > Postgres today. Essentially the software is a replacement for PGPool in > entirety but also supports analytics etc... When I was asking them about > pain points they talked about the below and I was wondering if this is a > problem we would like to solve. IMO: no one node knows the full state of the system, or can know it. I'd love PostgreSQL to help users more with scaling, HA, etc. But I think it's a big job. We'd need: - a node topology of some kind, communicated between nodes - heartbeat and monitoring - failover coordination - pooling/proxying - STONITH/fencing - etc. That said, I do think it'd be very desirable for us to introduce a greater link from a standby to master: - Get info about master. We should finish merging recovery.conf into postgresql.conf. - > b. Attempt to connect to the host directly, if not... > c. use the slave and use the hostname via dblink to connect to the master, > as the hostname , i.e. select * from dblink('" + connInfo + " > dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet). > This is necessary in the event the hostname used in the recovery.conf file > is not resolvable from the outside. OK, so "connect directly" here means from some 3rd party, the one doing the querying of the replica. > 1. The dblink call doesn't have a way to specify a timeout, so we have to > use Java futures to control how long this may take to a reasonable amount of > time; statement_timeout doesn't work? If not, that sounds like a sensible, separate feature to add. Patches welcome! > 2. NAT mapping may result in us detecting IP ranges that are not accessible > to the application nodes. PostgreSQL can't do anything about this one. > 3. there is no easy way to monitor for state changes as they happen, > allowing faster failovers, everything has to be polled based on events; It'd be pretty simple to write a function that sleeps in the backend until it's promoted. I don't know off the top of my head if we set all proc latches when we promote, but if we don't it's probably harmless and somewhat useful to do so. Either way, you'd do long-polling. Call the function and let the connection block until something interesting happens. Use TCP keepalives to make sure you notice if it dies. Have the function return when the state changes. > 4. It doesn't support cascading replication very well, although we could > augment the logic to allow us to map the relationship between nodes. > 5. There is no way to connect to a db node with something akin to > SQL-Server's "application intent" flags, to allow a connection to be > rejected if we wish it to be a read/write connection. This helps detect the > state of the node directly without having to ask any further questions of > the node, and makes it easier to "stall" during connection until a proper > connection can be made. That sounds desirable, and a good step toward eventually being able to transparently re-route read/write queries from replica to master. Which is where I'd like to land up eventually. Again, that'd be a sensible patch to submit, quite separately to the other topics. > 6. The master, on shutdown, will not actually close and disable connections > as it shuts down, instead, it will issue an error that it is shutting down > as it does so. Er, yes? I don't understand what you are getting at here. Can you describe expected vs actual behaviour in more detail? -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Oct 16, 2017 at 4:39 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote:
> 5. There is no way to connect to a db node with something akin to
> SQL-Server's "application intent" flags, to allow a connection to be
> rejected if we wish it to be a read/write connection. This helps detect the
> state of the node directly without having to ask any further questions of
> the node, and makes it easier to "stall" during connection until a proper
> connection can be made.
That sounds desirable, and a good step toward eventually being able to
transparently re-route read/write queries from replica to master.
Which is where I'd like to land up eventually.
It also sounds a lot like the connection parameter target_session_attrs, does it not? We don't reroute active connections based on it, and we're not smart enough to do anything beyond "try them one by one until you reach the one with the correct attributes", but the basic functionality is there. Basically what we already have fulfills what JD is suggesting, but not what Craig is, if I understand it correctly.
On Mon, 16 Oct 2017 10:39:16 +0800 Craig Ringer <craig@2ndquadrant.com> wrote: > On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote: > > I had a long call with a firm developing front end proxy/cache/HA for > > Postgres today. Essentially the software is a replacement for PGPool in > > entirety but also supports analytics etc... When I was asking them about > > pain points they talked about the below and I was wondering if this is a > > problem we would like to solve. > > IMO: no one node knows the full state of the system, or can know it. +1 > I'd love PostgreSQL to help users more with scaling, HA, etc. But I > think it's a big job. We'd need: > > - a node topology of some kind, communicated between nodes > - heartbeat and monitoring > - failover coordination > - pooling/proxying > - STONITH/fencing > - etc. And some of items on this list can not be in core. However, there's some things PostgreSQL can do to make HA easier to deal with. > That said, I do think it'd be very desirable for us to introduce a > greater link from a standby to master: > > - Get info about master. We should finish merging recovery.conf into > postgresql.conf. agree. +1. To make things easier from the "cluster manager" piece outside of PostgreSQL, I would add: * being able to "demote" a master as a standby without restart. * being able to check the status of each node without eating a backend connection (to avoid hitting "max_connection" limit) * being able to monitor each step of a switchover (or "controlled failover": standby/master role swapping between two nodes) > > b. Attempt to connect to the host directly, if not... > > c. use the slave and use the hostname via dblink to connect to the master, > > as the hostname , i.e. select * from dblink('" + connInfo + " > > dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet). > > This is necessary in the event the hostname used in the recovery.conf file > > is not resolvable from the outside. > > OK, so "connect directly" here means from some 3rd party, the one > doing the querying of the replica. It seems to me the failover process is issuing all required commands to move the master role to another available standby. The knowledge of the orchestration and final status (if everything went good) is in this piece of software. If you want to know where is your master in an exotic or complex setup, ask who was responsible to promote your master. HA should stay as simple as possible. The more the architecture is complex, the more you will have failing scenarios. > > 1. The dblink call doesn't have a way to specify a timeout, so we have to > > use Java futures to control how long this may take to a reasonable amount of > > time; > > statement_timeout doesn't work? > > If not, that sounds like a sensible, separate feature to add. Patches welcome! > > > 2. NAT mapping may result in us detecting IP ranges that are not accessible > > to the application nodes. > > PostgreSQL can't do anything about this one. You could get the master IP address from the "pg_stat_wal_receiver" view. But this is still not enough though. You might have dedicated networks for applications and for pgsql replication both separated. If you want a standby to tell the application where to connect to the master then you'll have to put this information yourself somewhere, accessible from application nodes. > > 3. there is no easy way to monitor for state changes as they happen, > > allowing faster failovers, everything has to be polled based on events; In the corosync world (the clustering piece of the Pacemaker ecosystem), node failure are detected really really fast. About 1s. Considering application failure (pgsql here), this will be polling, yes. But I fail to imagine how a dying application can warn the cluster before dying. Not only crashing (systemd could help there), but eg. before entering an infinite dummy loop or an exhausting one. > It'd be pretty simple to write a function that sleeps in the backend > until it's promoted. I don't know off the top of my head if we set all > proc latches when we promote, but if we don't it's probably harmless > and somewhat useful to do so. As soon as the cluster manager promoted a new master, it can trigger and event to notify whatever you need. > Either way, you'd do long-polling. Call the function and let the > connection block until something interesting happens. Use TCP > keepalives to make sure you notice if it dies. Have the function > return when the state changes. This would still rely on TCP keepalive frequency, back to polling :( Regards, -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 10/15/2017 07:39 PM, Craig Ringer wrote: > On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote: >> -Hackers, >> >> I had a long call with a firm developing front end proxy/cache/HA for >> Postgres today. Essentially the software is a replacement for PGPool in >> entirety but also supports analytics etc... When I was asking them about >> pain points they talked about the below and I was wondering if this is a >> problem we would like to solve. > > IMO: no one node knows the full state of the system, or can know it. That isn't exactly true. We do know if our replication state is current but only from the master which is part of the problem. > > I'd love PostgreSQL to help users more with scaling, HA, etc. But I > think it's a big job. We'd need: > > - a node topology of some kind, communicated between nodes > - heartbeat and monitoring > - failover coordination > - pooling/proxying > - STONITH/fencing > - etc. I don't think we need all of that. This is more of a request to make it easier for those deploying HA to determine the state of Postgres. > > That said, I do think it'd be very desirable for us to introduce a > greater link from a standby to master: > > - Get info about master. We should finish merging recovery.conf into > postgresql.conf. Definitely. >> b. Attempt to connect to the host directly, if not... >> c. use the slave and use the hostname via dblink to connect to the master, >> as the hostname , i.e. select * from dblink('" + connInfo + " >> dbname=postgres', 'select inet_server_addr()') AS t(inet_server_addr inet). >> This is necessary in the event the hostname used in the recovery.conf file >> is not resolvable from the outside. > > OK, so "connect directly" here means from some 3rd party, the one > doing the querying of the replica. > >> 1. The dblink call doesn't have a way to specify a timeout, so we have to >> use Java futures to control how long this may take to a reasonable amount of >> time; > > statement_timeout doesn't work? That would be a work around definitely but I think it would be better to say: ALTER SYSTEM SET PROMOTE TIMEOUT '120' (Example, let's not get off into the weeds :P) and if the standby can't receive a ping/ack within 120 it will promote itself. > PostgreSQL can't do anything about this one. Yes that's true. >> 4. It doesn't support cascading replication very well, although we could >> augment the logic to allow us to map the relationship between nodes. >> 5. There is no way to connect to a db node with something akin to >> SQL-Server's "application intent" flags, to allow a connection to be >> rejected if we wish it to be a read/write connection. This helps detect the >> state of the node directly without having to ask any further questions of >> the node, and makes it easier to "stall" during connection until a proper >> connection can be made. > > That sounds desirable, and a good step toward eventually being able to > transparently re-route read/write queries from replica to master. > Which is where I'd like to land up eventually. > > Again, that'd be a sensible patch to submit, quite separately to the > other topics. Great. > >> 6. The master, on shutdown, will not actually close and disable connections >> as it shuts down, instead, it will issue an error that it is shutting down >> as it does so. > > Er, yes? I don't understand what you are getting at here. Yes, I will need to go back to them on this one. I think what they mean is that if we have a connection that is getting closed it doesn't return why it is closing. It just throws an error. > > Can you describe expected vs actual behaviour in more detail? > I will need to get back to them on this but I think the behavior would be to have a return value of why the connection was closed vs just throwing an error. Say, "RETURN 66" means someone executed pg_going_to_failover() vs pg_terminate_backend() which could be for different reasons. Thanks for responding, I am mostly the intermediary here, JD -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc PostgreSQL Centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://pgconf.us ***** Unless otherwise stated, opinions are my own. ***** -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 10/16/2017 03:55 AM, Magnus Hagander wrote: > > > On Mon, Oct 16, 2017 at 4:39 AM, Craig Ringer <craig@2ndquadrant.com > <mailto:craig@2ndquadrant.com>> wrote: > > On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com > <mailto:jd@commandprompt.com>> wrote: > > 5. There is no way to connect to a db node with something akin to > > SQL-Server's "application intent" flags, to allow a connection to be > > rejected if we wish it to be a read/write connection. This helps detect the > > state of the node directly without having to ask any further questions of > > the node, and makes it easier to "stall" during connection until a proper > > connection can be made. > > That sounds desirable, and a good step toward eventually being able to > transparently re-route read/write queries from replica to master. > Which is where I'd like to land up eventually. > > > It also sounds a lot like the connection parameter target_session_attrs Ahh, this is part of the new libpq failover right? Thanks, JD -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc PostgreSQL Centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://pgconf.us ***** Unless otherwise stated, opinions are my own. ***** -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 17 October 2017 at 01:02, Joshua D. Drake <jd@commandprompt.com> wrote: > On 10/15/2017 07:39 PM, Craig Ringer wrote: >> >> On 13 October 2017 at 08:50, Joshua D. Drake <jd@commandprompt.com> wrote: >>> >>> -Hackers, >>> >>> I had a long call with a firm developing front end proxy/cache/HA for >>> Postgres today. Essentially the software is a replacement for PGPool in >>> entirety but also supports analytics etc... When I was asking them about >>> pain points they talked about the below and I was wondering if this is a >>> problem we would like to solve. >> >> >> IMO: no one node knows the full state of the system, or can know it. > > > That isn't exactly true. We do know if our replication state is current but > only from the master which is part of the problem. Sure. But unless you have a perfectly-reliable, latency-ignoring wormhole link between master and standby, the standby always has imperfect knowledge of the master. More importantly, it can never know for sure how old its knowledge is. https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-WAL-RECEIVER-VIEW already does about the best we can probably do. In particular last_msg_send_time and last_msg_receipt_time, used in combination with latest_end_lsn and latest_end_time. >> That said, I do think it'd be very desirable for us to introduce a >> greater link from a standby to master: >> >> - Get info about master. We should finish merging recovery.conf into >> postgresql.conf. > > > Definitely. There's a patch from Abhijit Menon-Sen you could adopt for PostgreSQL 11 for that. >>> 1. The dblink call doesn't have a way to specify a timeout, so we have >>> to >>> use Java futures to control how long this may take to a reasonable amount >>> of >>> time; >> >> >> statement_timeout doesn't work? > > > That would be a work around definitely but I think it would be better to > say: ALTER SYSTEM SET PROMOTE TIMEOUT '120' (Example, let's not get off into > the weeds :P) and if the standby can't receive a ping/ack within 120 it will > promote itself. I'm confused by this. I thought you were talking about timeouts querying status of an upstream over dblink. Not automatic self-promotion. I'm really not a fan of Pg standbys self-promoting without working with an external co-ordinator that handles STONITH/fencing. It's a recipe for disaster. That's what I was saying upthread, that implementing bits and pieces here can be quite dangerous. This also takes it well outside what you were talking about, improving the ability to detect Pg's state, and into having it become its own co-ordinator for HA actions. So lets go back to the original question. What's missing that statement_timeout doesn't provide for querying remote servers for their status over dblink? If you want a nicer way to say "look up whatever your conninfo in recovery.conf is, connect to it, get me some info on it and return it, possibly daisy-chaining up a chain of replicas if you reach the master" ... that's fine. But it's a different thing. >> Er, yes? I don't understand what you are getting at here. > > > Yes, I will need to go back to them on this one. I think what they mean is > that if we have a connection that is getting closed it doesn't return why it > is closing. It just throws an error. Yes, we do. From https://www.postgresql.org/docs/current/static/errcodes-appendix.html: Class 57 — Operator Intervention 57000 operator_intervention 57014 query_canceled 57P01 admin_shutdown 57P02 crash_shutdown57P03 cannot_connect_now 57P04 database_dropped Maybe they want more granularity in terms of what reasons are given and what errors are reported. That's fine. But please provide specifics. -- Craig Ringer http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 10/16/2017 07:31 PM, Craig Ringer wrote: > On 17 October 2017 at 01:02, Joshua D. Drake <jd@commandprompt.com> wrote: >> On 10/15/2017 07:39 PM, Craig Ringer wrote: >>> - Get info about master. We should finish merging recovery.conf into >>> postgresql.conf. >> >> >> Definitely. > > There's a patch from Abhijit Menon-Sen you could adopt for PostgreSQL > 11 for that. > Do you have a link to this? Thanks! JD -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc PostgreSQL Centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://pgconf.us ***** Unless otherwise stated, opinions are my own. ***** -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
On 10/17/2017 08:40 PM, Joshua D. Drake wrote: > On 10/16/2017 07:31 PM, Craig Ringer wrote: >> On 17 October 2017 at 01:02, Joshua D. Drake <jd@commandprompt.com> >> wrote: >>> On 10/15/2017 07:39 PM, Craig Ringer wrote: > >>>> - Get info about master. We should finish merging recovery.conf into >>>> postgresql.conf. >>> >>> Definitely. >> >> There's a patch from Abhijit Menon-Sen you could adopt for PostgreSQL >> 11 for that. > > Do you have a link to this? https://commitfest.postgresql.org/search/?searchterm=recovery.conf -- Vik Fearing +33 6 46 75 15 36 http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers