Thread: failover vs. read only queries
Hi, When the trigger file is created while the recovery keeps waiting for the release of the lock by read only queries, it might take a very long time for the standby to become the master. The recovery cannot go ahead until those read only queries have gone away. This would increase the downtime at the failover, and degrade the high availability. To fix the problem, when the trigger file is found, I think that we should cancel all the running read only queries immediately (or forcibly use -1 as the max_standby_delay since that point) and make the recovery go ahead. If some people prefer queries over failover even when they create the trigger file, we can make the trigger behavior selectable in response to the content of the trigger file like pg_standby does. This problem looks like a bug, so I'd like to fix that for 9.0. But the amount of code change might not be small. Thought? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
> When the trigger file is created while the recovery keeps > waiting for the release of the lock by read only queries, > it might take a very long time for the standby to become > the master. The recovery cannot go ahead until those read > only queries have gone away. This would increase the downtime > at the failover, and degrade the high availability. > > To fix the problem, when the trigger file is found, I think > that we should cancel all the running read only queries > immediately (or forcibly use -1 as the max_standby_delay > since that point) and make the recovery go ahead. If some > people prefer queries over failover even when they create the > trigger file, we can make the trigger behavior selectable in > response to the content of the trigger file like pg_standby > does. > > This problem looks like a bug, so I'd like to fix that for > 9.0. But the amount of code change might not be small. > Thought? +1. Down time of HA system is really important for HA users. -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
Fujii Masao <masao.fujii@gmail.com> wrote: > To fix the problem, when the trigger file is found, I think > that we should cancel all the running read only queries > immediately (or forcibly use -1 as the max_standby_delay > since that point) and make the recovery go ahead. Hmmm, does the following sequence work as your expect instead of the chanage? It requires text-file manipulation in 1, but seems to be more flexible. 1. Reset max_standby_delay = 0 in postgresql.conf 2. pg_ctl reload 3. Create a trigger file BTW, I hope we will have "pg_ctl failover --timeout=N" in 9.1 instead of the trigger file based management. Regards, --- Takahiro Itagaki NTT Open Source Software Center
On Wed, Jun 9, 2010 at 5:47 PM, Fujii Masao <masao.fujii@gmail.com> wrote: > To fix the problem, when the trigger file is found, I think > that we should cancel all the running read only queries > immediately (or forcibly use -1 as the max_standby_delay > since that point) and make the recovery go ahead. Oops! I made an error. I meant 0 instead of -1, as the max_standby_delay. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On Wed, Jun 9, 2010 at 6:13 PM, Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> wrote: >> To fix the problem, when the trigger file is found, I think >> that we should cancel all the running read only queries >> immediately (or forcibly use -1 as the max_standby_delay >> since that point) and make the recovery go ahead. > > Hmmm, does the following sequence work as your expect instead of the chanage? > It requires text-file manipulation in 1, but seems to be more flexible. > > 1. Reset max_standby_delay = 0 in postgresql.conf > 2. pg_ctl reload > 3. Create a trigger file As far as I read the HS code, SIGHUP is not checked while a recovery is waiting for queries :( So pg_ctl reload would have no effect on the conflicting queries. Independently from the problem I raised, I think that we should call HandleStartupProcInterrupts() in that sleep loop. > BTW, I hope we will have "pg_ctl failover --timeout=N" in 9.1 > instead of the trigger file based management. Please feel free to try that ;) Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao <masao.fujii@gmail.com> writes: > When the trigger file is created while the recovery keeps > waiting for the release of the lock by read only queries, > it might take a very long time for the standby to become > the master. The recovery cannot go ahead until those read > only queries have gone away. This would increase the downtime > at the failover, and degrade the high availability. > To fix the problem, when the trigger file is found, I think > that we should cancel all the running read only queries > immediately (or forcibly use -1 as the max_standby_delay > since that point) and make the recovery go ahead. If some > people prefer queries over failover even when they create the > trigger file, we can make the trigger behavior selectable in > response to the content of the trigger file like pg_standby > does. > This problem looks like a bug, so I'd like to fix that for > 9.0. But the amount of code change might not be small. > Thought? -1. This looks like 9.1 material to me, and besides I'm not even convinced that what you propose is a good solution. regards, tom lane
> To fix the problem, when the trigger file is found, I think > that we should cancel all the running read only queries > immediately (or forcibly use -1 as the max_standby_delay > since that point) and make the recovery go ahead. If some > people prefer queries over failover even when they create the > trigger file, we can make the trigger behavior selectable in > response to the content of the trigger file like pg_standby > does. Well, the question is: are there users who would prefer not to have slave queries cancelled and are willing to wait for failover? If so, behavior of failover should really be slaved to max_standby_delay. If not, there should be new behavior (i.e. "when the trigger file is found, cancel all running queries"). One could argue that there are no users of the first case. The fact that failover current does *not* terminate existing queries and transactions was regarded as a feature by the audience, rather than a bug, when I did demos of HS/SR. Of course, they might not have been thinking of the delay for writes. If there were an easy way to make the trigger file cancel all running queries, apply remaining logs and come up, then I'd vote for that for 9.0. I think it's the more desired behavior by most users. However, I'm opposed to any complex solutions which might delay 9.0 release. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Wed, Jun 9, 2010 at 3:22 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> To fix the problem, when the trigger file is found, I think >> that we should cancel all the running read only queries >> immediately (or forcibly use -1 as the max_standby_delay >> since that point) and make the recovery go ahead. If some >> people prefer queries over failover even when they create the >> trigger file, we can make the trigger behavior selectable in >> response to the content of the trigger file like pg_standby >> does. > > Well, the question is: are there users who would prefer not to have > slave queries cancelled and are willing to wait for failover? If so, > behavior of failover should really be slaved to max_standby_delay. If > not, there should be new behavior (i.e. "when the trigger file is found, > cancel all running queries"). One could argue that there are no users > of the first case. > > The fact that failover current does *not* terminate existing queries and > transactions was regarded as a feature by the audience, rather than a > bug, when I did demos of HS/SR. Of course, they might not have been > thinking of the delay for writes. > > If there were an easy way to make the trigger file cancel all running > queries, apply remaining logs and come up, then I'd vote for that for > 9.0. I think it's the more desired behavior by most users. However, > I'm opposed to any complex solutions which might delay 9.0 release. One complication here is that, at least as I understand it, Tom is planning to overhaul max_standby_delay. So it might be premature to try to figure out how this should work until the dust settles. But my intuition is similar to yours, overall. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
Josh Berkus <josh@agliodbs.com> writes: > The fact that failover current does *not* terminate existing queries and > transactions was regarded as a feature by the audience, rather than a > bug, when I did demos of HS/SR. Of course, they might not have been > thinking of the delay for writes. > If there were an easy way to make the trigger file cancel all running > queries, apply remaining logs and come up, then I'd vote for that for > 9.0. I think it's the more desired behavior by most users. However, > I'm opposed to any complex solutions which might delay 9.0 release. My feeling about it is that if you want fast failover you should not have your failover target server configured as hot standby at all, let alone hot standby with a long max_standby_delay. Such a slave could be very far behind on applying WAL when the crunch comes, and no amount of query killing will save you from that. Put your long-running standby queries on a different slave instead. We should consider whether we can improve the situation in 9.1, but it is not a must-fix for 9.0; especially when the correct behavior isn't immediately obvious. regards, tom lane
On Wed, 2010-06-09 at 12:22 -0700, Josh Berkus wrote: > > To fix the problem, when the trigger file is found, I think > > that we should cancel all the running read only queries > > immediately (or forcibly use -1 as the max_standby_delay > > since that point) and make the recovery go ahead. If some > > people prefer queries over failover even when they create the > > trigger file, we can make the trigger behavior selectable in > > response to the content of the trigger file like pg_standby > > does. > > Well, the question is: are there users who would prefer not to have > slave queries cancelled and are willing to wait for failover? If so, > behavior of failover should really be slaved to max_standby_delay. If > not, there should be new behavior (i.e. "when the trigger file is found, > cancel all running queries"). One could argue that there are no users > of the first case. > > The fact that failover current does *not* terminate existing queries and > transactions was regarded as a feature by the audience, rather than a > bug, when I did demos of HS/SR. Of course, they might not have been > thinking of the delay for writes. +1 Just to add: there is only a delay in triggering *if* the standby is waiting on a query at or after triggering. If there is a wait, it is never more than max_standby_delay, which is what the user said they would be happy to accept. > If there were an easy way to make the trigger file cancel all running > queries, apply remaining logs and come up, then I'd vote for that for > 9.0. I think it's the more desired behavior by most users. However, > I'm opposed to any complex solutions which might delay 9.0 release. In 8.4 you could specify "fast" failover or "smart" failover. In 9.0, AFAICS we have only implemented "smart" failover, which means it will continue until the end of the WAL stream before triggering. So under heavy streaming load or with considerable lag the trigger won't cause failover for some time. So there is less function in 9.0 than was available in 8.4. If that removal was intended, it wasn't discussed. -- Simon Riggs www.2ndQuadrant.com
Fujii Masao <masao.fujii@gmail.com> wrote: > > 1. Reset max_standby_delay = 0 in postgresql.conf > > 2. pg_ctl reload > > 3. Create a trigger file > > As far as I read the HS code, SIGHUP is not checked while a recovery > is waiting for queries :( So pg_ctl reload would have no effect on > the conflicting queries. > > Independently from the problem I raised, I think that we should call > HandleStartupProcInterrupts() in that sleep loop. Hmmm, if reload doesn't work, can we write a query like below? SELECT pg_terminate_backend(pid) FROM pg_locks WHERE conflicted-with-recovery-process; Regards, --- Takahiro Itagaki NTT Open Source Software Center
> The fact that failover current does *not* terminate existing queries and > transactions was regarded as a feature by the audience, rather than a > bug, when I did demos of HS/SR. Of course, they might not have been > thinking of the delay for writes. Probably you would hear different respose from serious users who are willing to have usable HA systems. I have number of customers who are using our HA systems (they use several technologies such as commercial HA solutions, pgpool-II and Slony-I). The one of top 3 questions I got when we propose them our HA solution is, "how long will it take to do failover when the master DB crashes?" -- Tatsuo Ishii SRA OSS, Inc. Japan English: http://www.sraoss.co.jp/index_en.php Japanese: http://www.sraoss.co.jp
On 10/06/10 14:07, Tatsuo Ishii wrote: > > The one of top 3 questions I got > when we propose them our HA solution is, "how long will it take to > do failover when the master DB crashes?" > > Same here +1
On Thu, Jun 10, 2010 at 5:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Josh Berkus <josh@agliodbs.com> writes: >> The fact that failover current does *not* terminate existing queries and >> transactions was regarded as a feature by the audience, rather than a >> bug, when I did demos of HS/SR. Of course, they might not have been >> thinking of the delay for writes. > >> If there were an easy way to make the trigger file cancel all running >> queries, apply remaining logs and come up, then I'd vote for that for >> 9.0. I think it's the more desired behavior by most users. However, >> I'm opposed to any complex solutions which might delay 9.0 release. > > My feeling about it is that if you want fast failover you should not > have your failover target server configured as hot standby at all, let > alone hot standby with a long max_standby_delay. Such a slave could be > very far behind on applying WAL when the crunch comes, and no amount of > query killing will save you from that. Put your long-running standby > queries on a different slave instead. > > We should consider whether we can improve the situation in 9.1, but it > is not a must-fix for 9.0; especially when the correct behavior isn't > immediately obvious. OK. Let's revisit in 9.1. I attached the proposal patch for 9.1. The patch treats max_standby_delay as zero (i.e., cancels all the conflicting queries immediately), ever since the trigger file is created. So we can cause a recovery to end without waiting for any lock held by queries, and minimize the failover time. OTOH, queries which don't conflict with a recovery survive the failover. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Attachment
On Thu, Jun 10, 2010 at 9:58 AM, Takahiro Itagaki <itagaki.takahiro@oss.ntt.co.jp> wrote: > > Fujii Masao <masao.fujii@gmail.com> wrote: > >> > 1. Reset max_standby_delay = 0 in postgresql.conf >> > 2. pg_ctl reload >> > 3. Create a trigger file >> >> As far as I read the HS code, SIGHUP is not checked while a recovery >> is waiting for queries :( So pg_ctl reload would have no effect on >> the conflicting queries. >> >> Independently from the problem I raised, I think that we should call >> HandleStartupProcInterrupts() in that sleep loop. > > Hmmm, if reload doesn't work, can we write a query like below? > > SELECT pg_terminate_backend(pid) > FROM pg_locks > WHERE conflicted-with-recovery-process; I'm not sure that, but as you suggested, we can minimize the failover time by using the following operation even in 9.0. 1. Reset max_standby_delay = 0 in postgresql.conf 2. pg_ctl reload 3. Cancel all the queries or all the conflictingones 4. Create a trigger file For now, I'll use the above when building the HA system using 9.0 and a clusterware. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
On 06/09/2010 07:36 PM, Mark Kirkwood wrote: > On 10/06/10 14:07, Tatsuo Ishii wrote: >> >> The one of top 3 questions I got >> when we propose them our HA solution is, "how long will it take to >> do failover when the master DB crashes?" >> > > Same here +1 In that case, wouldn't they set max_standby_delay to 0? In which case the failover problem goes away, no? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
On Fri, Jun 11, 2010 at 1:48 AM, Josh Berkus <josh@agliodbs.com> wrote: > On 06/09/2010 07:36 PM, Mark Kirkwood wrote: >> >> On 10/06/10 14:07, Tatsuo Ishii wrote: >>> >>> The one of top 3 questions I got >>> when we propose them our HA solution is, "how long will it take to >>> do failover when the master DB crashes?" >>> >> >> Same here +1 > > In that case, wouldn't they set max_standby_delay to 0? In which case the > failover problem goes away, no? Yes, but I guess they'd also like to run read only queries on the standby. Setting max_standby_delay to 0 would prevent them from doing that because the conflict with the replay of the VACUUM or HOT record would often happen. vacuum_defer_cleanup_age would be helpful for that case, but it seems to be hard to tune that. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center
Fujii Masao wrote: > On Thu, Jun 10, 2010 at 5:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Josh Berkus <josh@agliodbs.com> writes: > >> The fact that failover current does *not* terminate existing queries and > >> transactions was regarded as a feature by the audience, rather than a > >> bug, when I did demos of HS/SR. ?Of course, they might not have been > >> thinking of the delay for writes. > > > >> If there were an easy way to make the trigger file cancel all running > >> queries, apply remaining logs and come up, then I'd vote for that for > >> 9.0. ?I think it's the more desired behavior by most users. ?However, > >> I'm opposed to any complex solutions which might delay 9.0 release. > > > > My feeling about it is that if you want fast failover you should not > > have your failover target server configured as hot standby at all, let > > alone hot standby with a long max_standby_delay. ?Such a slave could be > > very far behind on applying WAL when the crunch comes, and no amount of > > query killing will save you from that. ?Put your long-running standby > > queries on a different slave instead. > > > > We should consider whether we can improve the situation in 9.1, but it > > is not a must-fix for 9.0; especially when the correct behavior isn't > > immediately obvious. > > OK. Let's revisit in 9.1. > > I attached the proposal patch for 9.1. The patch treats max_standby_delay > as zero (i.e., cancels all the conflicting queries immediately), ever since > the trigger file is created. So we can cause a recovery to end without > waiting for any lock held by queries, and minimize the failover time. > OTOH, queries which don't conflict with a recovery survive the failover. Should this be added to the first 9.1 commitfest? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + None of us is going to be here forever. +
Bruce Momjian <bruce@momjian.us> writes: > Fujii Masao wrote: >> On Thu, Jun 10, 2010 at 5:06 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> My feeling about it is that if you want fast failover you should not >>> have your failover target server configured as hot standby at all, let >>> alone hot standby with a long max_standby_delay. Such a slave could be >>> very far behind on applying WAL when the crunch comes, and no amount of >>> query killing will save you from that. Put your long-running standby >>> queries on a different slave instead. >>> >>> We should consider whether we can improve the situation in 9.1, but it >>> is not a must-fix for 9.0; especially when the correct behavior isn't >>> immediately obvious. >> OK. Let's revisit in 9.1. >> >> I attached the proposal patch for 9.1. The patch treats max_standby_delay >> as zero (i.e., cancels all the conflicting queries immediately), ever since >> the trigger file is created. So we can cause a recovery to end without >> waiting for any lock held by queries, and minimize the failover time. >> OTOH, queries which don't conflict with a recovery survive the failover. > Should this be added to the first 9.1 commitfest? Not sure ... it seems like proof of concept for a pretty dubious concept. If you want a slave to be ready for fast failover then you should not be letting it get far behind the master in the first place. I think there's some missing piece here, but I'm not quite sure what to propose. regards, tom lane