Thread: Hot Standby conflict resolution handling
<br />I was trying some simple queries on a Hot Standby with streaming replication.<br /><br />On standby, I do this:<br/>postgres=# begin transaction isolation level repeatable read;<br />BEGIN<br />postgres=# explain verbose selectcount(b) from test WHERE a > 100000;<br /><br clear="all" />On master, I insert a bunch of tuples in the table andrun VACUUM ANALYZE.<br />postgres=# INSERT INTO test VALUES (generate_series(110001,120000), 'foo', 1);<br />INSERT 010000<br />postgres=# VACUUM ANALYZE test;<br /> VACUUM<br /><br />After max_standby_streaming_delay, the standby startscancelling the queries. I get an error like this on the standby:<br />postgres=# explain verbose select count(b) fromtest WHERE a > 100000;<br />FATAL: terminating connection due to conflict with recovery<br /> DETAIL: User querymight have needed to see row versions that must be removed.<br />HINT: In a moment you should be able to reconnectto the database and repeat your command.<br />server closed the connection unexpectedly<br /> This probablymeans the server terminated abnormally<br /> before or while processing the request.<br />The connection to theserver was lost. Attempting reset: Succeeded.<br /><br />So I've couple questions/concerns here<br /><br />1. Why to throwa FATAL error here ? A plain ERROR should be enough to abort the transaction. There are four places in ProcessInterrupts()where we throw these kind of errors and three of them are FATAL.<br /><br />2911 if (DoingCommandRead)<br/> 2912 ereport(FATAL,<br />2913 (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),<br/>2914 errmsg("terminating connection due to conflictwith recovery"),<br /> 2915 errdetail_recovery_conflict(),<br />2916 errhint("Ina moment you should be able to reconnect to the"<br />2917 " database and repeat yourcommand.")));<br /> 2918 else <br />2919 ereport(ERROR,<br />2920 (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),<br />2921 errmsg("canceling statementdue to conflict with recovery"),<br /> 2922 errdetail_recovery_conflict()));<br /><br />Inthis particular test case, the backend is DoingCommandRead and that forces a FATAL error. I'm not sure why is that required. And even if its necessary, IMHO we should add a comment explaining that. In the other two places where we throwFATAL, one looks legitimate, but I'm not sure about the other.<br /><br />2836 else if (RecoveryConflictPending&& RecoveryConflictRetryable)<br />2837 {<br />2838 pgstat_report_recovery_conflict(RecoveryConflictReason);<br/>2839 ereport(FATAL,<br /> 2840 (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),<br />2841 errmsg("terminating connectiondue to conflict with recovery"),<br />2842 errdetail_recovery_conflict()));<br /> 2843 }<br />2844 else if (RecoveryConflictPending)<br />2845 {<br />2846 /* Currentlythere is only one non-retryable recovery conflict */<br />2847 Assert(RecoveryConflictReason == PROCSIG_RECOVERY_CONFLICT_DATABASE);<br/> 2848 pgstat_report_recovery_conflict(RecoveryConflictReason);<br />2849 ereport(FATAL,<br />2850 (errcode(ERRCODE_DATABASE_DROPPED),<br />2851 errmsg("terminating connection due to conflict with recovery"),<br /> 2852 errdetail_recovery_conflict()));<br/>2853 }<br /><br />AFAICS the first of these should be ereport(ERROR). Otherwiseirrespective of whether RecoveryConflictRetryable is true or false, we will always ereport(FATAL).<br /><br />2.For my test, the error message itself looks wrong because I did not actually remove any rows on the master. VACUUM probablymarked a bunch of pages as all-visible and that should have triggered a conflict on the standby in order to supportindex-only scans. IMHO we should improve the error message to avoid any confusion. Or we can add a new ProcSignalReasonto differentiate between a cancel due to clean up vs visibilitymap_set() operation.<br /><br />BTW, my currentset up is using 9.2.1, but I don't see any code changes in the master. So I would assume the issues will exist theretoo.<br /><br />Thanks,<br />Pavan<br /><br />-- <br />Pavan Deolasee<br /><a href="http://www.linkedin.com/in/pavandeolasee"target="_blank">http://www.linkedin.com/in/pavandeolasee</a><br />
Hi, On 2012-12-04 12:30:43 +0530, Pavan Deolasee wrote: > I was trying some simple queries on a Hot Standby with streaming > replication. > > On standby, I do this: > postgres=# begin transaction isolation level repeatable read; > BEGIN > postgres=# explain verbose select count(b) from test WHERE a > 100000; > > On master, I insert a bunch of tuples in the table and run VACUUM ANALYZE. > postgres=# INSERT INTO test VALUES (generate_series(110001,120000), 'foo', > 1); > INSERT 0 10000 > postgres=# VACUUM ANALYZE test; > VACUUM > > After max_standby_streaming_delay, the standby starts cancelling the > queries. I get an error like this on the standby: > postgres=# explain verbose select count(b) from test WHERE a > 100000; > FATAL: terminating connection due to conflict with recovery > DETAIL: User query might have needed to see row versions that must be > removed. > HINT: In a moment you should be able to reconnect to the database and > repeat your command. > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: Succeeded. > > So I've couple questions/concerns here > > 1. Why to throw a FATAL error here ? A plain ERROR should be enough to > abort the transaction. There are four places in ProcessInterrupts() where > we throw these kind of errors and three of them are FATAL. The problem here is that were in IDLE IN TRANSACTION in this case. Which currently cannot be cancelled (i.e. pg_cancel_backend() just won't do anything). There are two problems making this non-trivial. For one, while we're in IDLE IN TXN the client doesn't expect a response on a protocol level, so we can't simply ereport() at that time. For another, when were in IDLE IN TXN we're potentially inside openssl so we can't jump out of there anyway because that would quite likely corrupt the internal state of openssl. I tried to fix this before (c.f. "Idle in transaction cancellation" or similar) but while I had some kind of fix for the first issue (i saved the error and reported it later when the protocol state allows it) I missed the jumping out of openssl bit. I think its not that hard to solve though. I remember having something preliminary but I never had the time to finish it. If I remember correctly the trick was to set openssl into non-blocking mode temporarily and return to the caller inside be-secure.c:my_sock_read. At that location ProcessInterrupts can run safely, error out silently, and reraise the error once were in the right protocol state. > 2836 else if (RecoveryConflictPending && RecoveryConflictRetryable) > 2837 { > 2838 pgstat_report_recovery_conflict(RecoveryConflictReason); > 2839 ereport(FATAL, > 2840 (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), > 2841 errmsg("terminating connection due to conflict with > recovery"), > 2842 errdetail_recovery_conflict())); > 2843 } > 2844 else if (RecoveryConflictPending) > 2845 { > 2846 /* Currently there is only one non-retryable recovery > conflict */ > 2847 Assert(RecoveryConflictReason == > PROCSIG_RECOVERY_CONFLICT_DATABASE); > 2848 pgstat_report_recovery_conflict(RecoveryConflictReason); > 2849 ereport(FATAL, > 2850 (errcode(ERRCODE_DATABASE_DROPPED), > 2851 errmsg("terminating connection due to conflict with > recovery"), > 2852 errdetail_recovery_conflict())); > 2853 } > > AFAICS the first of these should be ereport(ERROR). Otherwise irrespective > of whether RecoveryConflictRetryable is true or false, we will always > ereport(FATAL). Which is fine, because were below if (ProcDiePending). Note there's a separate path for QueryCancelPending. We go on to kill connections once the normal conflict handling has tried several times. > 2. For my test, the error message itself looks wrong because I did not > actually remove any rows on the master. VACUUM probably marked a bunch of > pages as all-visible and that should have triggered a conflict on the > standby in order to support index-only scans. IMHO we should improve the > error message to avoid any confusion. Or we can add a new ProcSignalReason > to differentiate between a cancel due to clean up vs visibilitymap_set() > operation. I think we desparately need to improve *all* of these message with significantly more detail (cause for cancellation, relation, current xid, conflicting xid, current/last query). Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Dec 4, 2012 at 1:44 PM, Andres Freund <andres@2ndquadrant.com> wrote:
The problem here is that were in IDLE IN TRANSACTION in this case. Which
>
> After max_standby_streaming_delay, the standby starts cancelling the
> queries. I get an error like this on the standby:
> postgres=# explain verbose select count(b) from test WHERE a > 100000;
> FATAL: terminating connection due to conflict with recovery
> DETAIL: User query might have needed to see row versions that must be
> removed.
> HINT: In a moment you should be able to reconnect to the database and
> repeat your command.
> server closed the connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.
> The connection to the server was lost. Attempting reset: Succeeded.
>
> So I've couple questions/concerns here
>
> 1. Why to throw a FATAL error here ? A plain ERROR should be enough to
> abort the transaction. There are four places in ProcessInterrupts() where
> we throw these kind of errors and three of them are FATAL.
currently cannot be cancelled (i.e. pg_cancel_backend() just won't do
anything).
There are two problems making this non-trivial. For one, while we're in
IDLE IN TXN the client doesn't expect a response on a protocol level, so
we can't simply ereport() at that time.
For another, when were in IDLE IN TXN we're potentially inside openssl
so we can't jump out of there anyway because that would quite likely
corrupt the internal state of openssl.
I tried to fix this before (c.f. "Idle in transaction cancellation" or
similar) but while I had some kind of fix for the first issue (i saved
the error and reported it later when the protocol state allows it) I
missed the jumping out of openssl bit. I think its not that hard to
solve though. I remember having something preliminary but I never had
the time to finish it. If I remember correctly the trick was to set
openssl into non-blocking mode temporarily and return to the caller
inside be-secure.c:my_sock_read.
Thanks Andres. I also read the original thread and I now understand why we are using FATAL here, at least until we have a better solution. Obviously the connection reset is no good either because as someone commented in the original discussion, I thought that I'm seeing a server crash while it was not.
Which is fine, because were below if (ProcDiePending). Note there's a
>
> AFAICS the first of these should be ereport(ERROR). Otherwise irrespective
> of whether RecoveryConflictRetryable is true or false, we will always
> ereport(FATAL).
separate path for QueryCancelPending. We go on to kill connections once
the normal conflict handling has tried several times.
Ok. Understood.I now see that every path below if (ProcDiePending) will call FATAL, albeit with different error codes. That explains the current code.
I think we desparately need to improve *all* of these message with
significantly more detail (cause for cancellation, relation, current
xid, conflicting xid, current/last query).
I agree.
Thanks,
Pavan
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On Tue, Dec 4, 2012 at 6:01 PM, Pavan Deolasee <pavan.deolasee@gmail.com> wrote:
--
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
Thanks Andres. I also read the original thread and I now understand why we are using FATAL here, at least until we have a better solution. Obviously the connection reset is no good either because as someone commented in the original discussion, I thought that I'm seeing a server crash while it was not.
How about attached comment to be added at the appropriate place ? I've extracted this mostly from Tom's explanation in the original thread.
Thanks,
Pavan
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
Attachment
Pavan, * Pavan Deolasee (pavan.deolasee@gmail.com) wrote: > On Tue, Dec 4, 2012 at 6:01 PM, Pavan Deolasee <pavan.deolasee@gmail.com>wrote: > > Thanks Andres. I also read the original thread and I now understand why we > > are using FATAL here, at least until we have a better solution. Obviously > > the connection reset is no good either because as someone commented in the > > original discussion, I thought that I'm seeing a server crash while it was > > not. > > How about attached comment to be added at the appropriate place ? I've > extracted this mostly from Tom's explanation in the original thread. I was hoping to see an update to the actual error messages returned in this patch.. I agree that it's good to add the comments but that doesn't do anything to help the user out in this case. Regarding the actual comment, here's the wording that I'd use: ----------------------- If we are in DoingCommandRead state, we can't use ereport(ERROR) because we can't long jumps in this state. If we attempt to longjmps in this state, we not only risk breaking protocol at our level, but also risk leaving openssl in an inconsistent state, either violating the ssl protocol or having its internal state trashed because it was interrupted while in the middle of changing that state. Currently, the only option is to promote ERROR to FATAL until we figure out a way to handle errors more effectively while in this state. ----------------------- If you agree with that wording update, can you update the patch? Thanks, Stephen
At 2012-12-29 14:23:45 -0500, sfrost@snowman.net wrote: > > Regarding the actual comment, here's the wording that I'd use: Sorry for nitpicking, but "we can't long jumps" made me cringe. Here's a slightly more condensed version: /* * We can't use ereport(ERROR) here, because any longjmps * in DoingCommandRead state run the risk of violating our * protocol or the SSL protocol, by interrupting OpenSSL in * the middle of changing its internal state. * * Currently, the only option is to promote ERROR to FATAL * until we figure out a better way to handle errors in this * state. */ Patch along these lines attached, which also removes trailing whitespace from the original patch. -- Abhijit
Attachment
On Thu, Jan 17, 2013 at 9:26 AM, Abhijit Menon-Sen <ams@2ndquadrant.com> wrote:
-- At 2012-12-29 14:23:45 -0500, sfrost@snowman.net wrote:Sorry for nitpicking, but "we can't long jumps" made me cringe.
>
> Regarding the actual comment, here's the wording that I'd use:
Here's a slightly more condensed version:
/*
* We can't use ereport(ERROR) here, because any longjmps
* in DoingCommandRead state run the risk of violating our
* protocol or the SSL protocol, by interrupting OpenSSL in
* the middle of changing its internal state.
*
* Currently, the only option is to promote ERROR to FATAL
* until we figure out a better way to handle errors in this
* state.
*/
Patch along these lines attached, which also removes trailing
whitespace from the original patch.
Thanks Stephen and Abhijit for improving the comments. I like this wording. So +1 from my side. Abhijit, do you want to add the patch and change the CF status appropriately ?
Thanks,
Pavan
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
Abhijit Menon-Sen <ams@2ndQuadrant.com> writes: > Sorry for nitpicking, but "we can't long jumps" made me cringe. Agreed :-( > Here's a slightly more condensed version: I find this still not quite right, because where it's placed, it applies to both the DoingCommandRead and !DoingCommandRead branches of the if/else statement. The wording would be okay if placed inside the first if branch, but I think visually that would look ugly. I'm inclined to suggest wording it like * If we're in DoingCommandRead state, we can't use ereport(ERROR),* because any longjmp would risk interrupting OpenSSL operations*and thus confusing that library and/or violating wire protocol. plus your second para as-is. But having said that ... are we sure this code is not actually broken? ISTM that if we dare not interrupt for fear of confusing OpenSSL, we cannot safely attempt to send an error message to the client either; but ereport(FATAL) will try exactly that. regards, tom lane
On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
But having said that ... are we sure this code is not actually broken?
I'm not.
ISTM that if we dare not interrupt for fear of confusing OpenSSL, we
cannot safely attempt to send an error message to the client either;
but ereport(FATAL) will try exactly that.
I thought since FATAL will force the backend to exit, we don't care much about corrupted OpenSSL state. I even thought that's why we raise ERROR to FATAL so that the backend can start in a clean state. But clearly I'm missing a point here because you don't think that way.
Thanks,
Pavan
Pavan Deolasee
http://www.linkedin.com/in/pavandeolasee
On 2013-01-17 01:38:31 -0500, Tom Lane wrote: > But having said that ... are we sure this code is not actually broken? > ISTM that if we dare not interrupt for fear of confusing OpenSSL, we > cannot safely attempt to send an error message to the client either; > but ereport(FATAL) will try exactly that. You're absolutely right. ISTM, to fix it we would have to either provide a COMERROR like facility for FATAL errors or just remove the requirement of raising an error exactly there. If I remember what I tried before correctly the latter seems to involve setting openssl into nonblocking mode and check for the saved error in the EINTR loop in be-secure and re-raise the error from there. Do we want to backport either - there hasn't been any report that I could link to it right now, but on the other hand it could possibly cause rather ugly problems (data leakage, segfaults, code execution aren't all that improbable)? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Pavan Deolasee <pavan.deolasee@gmail.com> writes: > On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ISTM that if we dare not interrupt for fear of confusing OpenSSL, we >> cannot safely attempt to send an error message to the client either; >> but ereport(FATAL) will try exactly that. > I thought since FATAL will force the backend to exit, we don't care much > about corrupted OpenSSL state. I even thought that's why we raise ERROR to > FATAL so that the backend can start in a clean state. But clearly I'm > missing a point here because you don't think that way. If we were to simply exit(1), leaving the kernel to close the client socket, it'd be safe enough because control would never have returned to OpenSSL. But this code doesn't do that. What we're looking at is that we've interrupted OpenSSL at some arbitrary point, and now we're going to make fresh calls to it to try to pump the FATAL error message out to the client. It seems fairly unlikely that that's safe. I'm not sure I credit Andres' worry of arbitrary code execution, but I do fear that OpenSSL could get confused to the point of freezing up, or even more likely that it would transmit garbage to the client, which rather defeats the purpose. Don't see a nice fix. The COMMERROR approach (ie, don't try to send anything to the client, only the log) is not nice at all since the client would get the impression that the server crashed. On the other hand, anything else requires waiting till we get control back from OpenSSL, which might be a long time, and meanwhile we're still holding locks that prevent WAL recovery from proceeding. regards, tom lane
On 2013-01-17 10:19:23 -0500, Tom Lane wrote: > Pavan Deolasee <pavan.deolasee@gmail.com> writes: > > On Thu, Jan 17, 2013 at 12:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> ISTM that if we dare not interrupt for fear of confusing OpenSSL, we > >> cannot safely attempt to send an error message to the client either; > >> but ereport(FATAL) will try exactly that. > > > I thought since FATAL will force the backend to exit, we don't care much > > about corrupted OpenSSL state. I even thought that's why we raise ERROR to > > FATAL so that the backend can start in a clean state. But clearly I'm > > missing a point here because you don't think that way. > > If we were to simply exit(1), leaving the kernel to close the client > socket, it'd be safe enough because control would never have returned to > OpenSSL. But this code doesn't do that. What we're looking at is that > we've interrupted OpenSSL at some arbitrary point, and now we're going > to make fresh calls to it to try to pump the FATAL error message out to > the client. It seems fairly unlikely that that's safe. I'm not sure > I credit Andres' worry of arbitrary code execution, but I do fear that > OpenSSL could get confused to the point of freezing up, or even more > likely that it would transmit garbage to the client, which rather > defeats the purpose. I don't think its likely either, I seem to remember it copying arround function pointers though, so it seems possible with some bad luck. > Don't see a nice fix. The COMMERROR approach (ie, don't try to send > anything to the client, only the log) is not nice at all since the > client would get the impression that the server crashed. On the other > hand, anything else requires waiting till we get control back from > OpenSSL, which might be a long time, and meanwhile we're still holding > locks that prevent WAL recovery from proceeding. I think we can make openssl return pretty much immediately if we assume recv() can reliably interrupted by signals, possibly by setting the socket to nonblocking in the signal handler. We just need to tell openssl not to retry immediately and we should be fine. Given that quite some people use openssl with nonblocking sockets, that code path should be reasonably safe. That still requires ugliness around saving the error and reraising it after returning from openssl though... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services