Re: Hot Standy introduced problem with query cancel behavior - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Hot Standy introduced problem with query cancel behavior
Date
Msg-id 4B4C4165.9070902@anarazel.de
Whole thread Raw
In response to Re: Hot Standy introduced problem with query cancel behavior  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On 01/12/10 09:40, Simon Riggs wrote:
> On Tue, 2010-01-12 at 06:30 +0100, Andres Freund wrote:
>> On 01/07/10 22:37, Andres Freund wrote:
>>> On Thursday 07 January 2010 22:28:46 Tom Lane wrote:
>>>> Andres Freund<andres@anarazel.de>   writes:
>>>>> I did not want to suggest using Simons code there. Sorry for the brevity.
>>>>> should have read as "revert to old code and add two step killing (thats
>>>>> likely around 10 lines or so)".
>>>>>
>>>>> "two step killing" meaning that we signal ERROR for a few times and if
>>>>> nothing happens that we like, we signal FATAL.
>>>>> As the code already loops around signaling anyway that should be easy to
>>>>> implement.
>>>> Ah.  This loop happens in the process that's trying to send the cancel
>>>> signal, correct, not the one that needs to respond to it?  That sounds
>>>> fairly sane to me.
>>> Yes.
>>>
>>>
>>>> There are some things we could do to make it more likely that a cancel
>>>> of this type is accepted --- for instance, give it a distinct SQLSTATE
>>>> code that *can not* be trapped by plpgsql EXCEPTION blocks --- but there
>>>> is no practical way to guarantee it except elog(FATAL).  I'm not
>>>> entirely convinced that an untrappable error would be a good thing
>>>> anyway; it's hard to argue that that's much better than a FATAL.
>>> Well a session which is usable after a transaction abort is quite sensible -
>>> quite some software I know handles a failing transaction much more gracefully
>>> than a session abort (e.g. because it has to deal with serialization failures
>>> and such anyway).
>>>
>>> So making it cought by fewer places and degrading to FATAL sound sensible and
>>> relatively easy to me.
>>> Unless somebody disagrees I will give it a shot.
>> Ok, here is a stab at that:
>>
>> 1. Allow the signal multiplexing facility to transfer one sig_atomic_t
>> worth of data. This is usefull e.g. for making operations idempotent
>> or more precise.
>>
>> In this the LocalBackendId is transported - that should make it
>> impossible to cancel the wrong transaction
>
> This doesn't remove the possibility of cancelling the wrong transaction,
> it just reduces the chances. You can still cancel a session with
> LocalBackendId == 1 and then have it accidentally cancel the next
> session with the same backendid. That's why I didn't chase this idea
> myself earlier.
Hm. I don't think so. Backend Initialization clears those flags. And the 
signal is sent to a specific pid so you cant send it to a new backend 
when targeting the old one.

So how should that occur?

> Closing that loophole isn't a priority and is best left until we have
> the other things have been cleaned up.
Yes, maybe. It was just rather easy to fix and fixed the problem 
sufficiently so that I could not reproduce it in contrast to seeing the 
problem during a regular testrun.

>> 2.
>>
>> AbortTransactionAndAnySubtransactions is only used in the mainloops
>> error handler as it should be unproblematic there.
>>
>> In the current CVS code ConditionalVirtualXactLockTableWait() in
>> ResolveRecoveryConflictWithVirtualXIDs does the wait for every try of
>> cancelling the other transaction.
>>
>> I moved the retry logic into CancelVirtualTransaction(). If 50 times a
>> ERROR does nothing it degrades to FATAL
>
> It's a good idea but not the right one, IMHO. I'd rather that the
> backend decided whether the instruction results in an ERROR or a FATAL
> based upon its own state, rather than have Startup process bang its head
> against the wall 50 times without knowing why.
I don't think its that easy to keep enough state there in a safe manner.
Also the concept of retrying is not new - I just made it degrade and not 
rewait for max_standby_delay (which imho is a bug).

In many cases the retries and repeated ERRORs will be enough to free the 
backend out of all PG_TRY/CATCH blocks that the next ERROR reaches the 
mainloop. So the retrying is quite sensible - and you cant do that part 
in the signalled backend itself.


>> XXX: I temporarily do not use the skipExistingConflicts argument of
>> GetConflictingVirtualXIDs - I dont understand its purpose and a bit of
>> infrastructure is missing right now as the recoveryConflictMode is not
>> stored anymore anywhere. Can easily be added back though.
Is that a leftover from additional capabilities (deferred conflict 
handling?)? Because currently there will be never a case with two 
different cancellations at the same time. Also the current logic throws 
away a FATAL error if a ERROR is already there.

>> 3.
>> Add a new error code ERRCODE_QUERY_CANCELED_HS for use with HS
>> indicating a failure that is more than a plain
>> ERRCODE_QUERY_CANCELED - namely it should not be caught from
>> various places like savepoints and in PLs.
>>
>> Exemplarily I checked for that error code in plpgsql and make it
>> uncatcheable.
>>
>> I am not sure how new errorcode codes get chosen though - and the name
>> is not that nice.
>>
>> Opinions on that?
>
> I don't trust it yet, as a mechanism. Changing only PL/pgSQL seems
> fairly broken also. Again, this seems like something that be handled
> separately.
Well, that was an exemplary change - its easy to fix that at other 
places as well. Without any identifier of such an abort I don't see how 
that could work.

We simply cant break out of nested transactions if were outside the 
mainloop.

>> I copied quite a bit from Simons earlier patch...
> It changes a few things around and adds the ideas you've mentioned,
> though seeing them as code doesn't in itself move the discussion
> forwards. There are far too many new ideas in this patch for it to be
> even close to acceptable, to me. Yes, lets discuss these additional
> ideas, but after a basic patch has been applied.
Sure, the "targeted" signalling part can be left of, yes. But the rest 
is mostly necessary I think.

> I do value your interest and input, though racing me to rework my own
> patch, then asking me to review it seems like wasted time for both of
> us. I will post a new patch on ~Friday.
Well. I explicitly asked whether you or somebody else is going after 
this. Waited two days and only then started working on it.
And you seem to have enough on your plate...

> (Also, please make your braces { } follow the normal coding conventions
> for Postgres.)
Sure.

>> Currently the patch does not yet do anything to avoid letting the
>> protocol out of sync. What do you think about adding a flag for error
>> codes not to communicate with the client (Similarly to COMERROR)?
>>
>> So that one could do an elog(ERROR&  ERROR_NO_SEND_CLIENT, .. or such?
> Seems fairly important piece.
Its quite a bit of manual work though so I wanted to be sure thats an 
agreeable approach.


Andres


pgsql-hackers by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: mailing list archiver chewing patches
Next
From: Peter Eisentraut
Date:
Subject: Re: Compression Library and Usages