Re: Hot Standy introduced problem with query cancel behavior - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Hot Standy introduced problem with query cancel behavior |
Date | |
Msg-id | 4B4C4165.9070902@anarazel.de Whole thread Raw |
In response to | Re: Hot Standy introduced problem with query cancel behavior (Simon Riggs <simon@2ndQuadrant.com>) |
List | pgsql-hackers |
On 01/12/10 09:40, Simon Riggs wrote: > On Tue, 2010-01-12 at 06:30 +0100, Andres Freund wrote: >> On 01/07/10 22:37, Andres Freund wrote: >>> On Thursday 07 January 2010 22:28:46 Tom Lane wrote: >>>> Andres Freund<andres@anarazel.de> writes: >>>>> I did not want to suggest using Simons code there. Sorry for the brevity. >>>>> should have read as "revert to old code and add two step killing (thats >>>>> likely around 10 lines or so)". >>>>> >>>>> "two step killing" meaning that we signal ERROR for a few times and if >>>>> nothing happens that we like, we signal FATAL. >>>>> As the code already loops around signaling anyway that should be easy to >>>>> implement. >>>> Ah. This loop happens in the process that's trying to send the cancel >>>> signal, correct, not the one that needs to respond to it? That sounds >>>> fairly sane to me. >>> Yes. >>> >>> >>>> There are some things we could do to make it more likely that a cancel >>>> of this type is accepted --- for instance, give it a distinct SQLSTATE >>>> code that *can not* be trapped by plpgsql EXCEPTION blocks --- but there >>>> is no practical way to guarantee it except elog(FATAL). I'm not >>>> entirely convinced that an untrappable error would be a good thing >>>> anyway; it's hard to argue that that's much better than a FATAL. >>> Well a session which is usable after a transaction abort is quite sensible - >>> quite some software I know handles a failing transaction much more gracefully >>> than a session abort (e.g. because it has to deal with serialization failures >>> and such anyway). >>> >>> So making it cought by fewer places and degrading to FATAL sound sensible and >>> relatively easy to me. >>> Unless somebody disagrees I will give it a shot. >> Ok, here is a stab at that: >> >> 1. Allow the signal multiplexing facility to transfer one sig_atomic_t >> worth of data. This is usefull e.g. for making operations idempotent >> or more precise. >> >> In this the LocalBackendId is transported - that should make it >> impossible to cancel the wrong transaction > > This doesn't remove the possibility of cancelling the wrong transaction, > it just reduces the chances. You can still cancel a session with > LocalBackendId == 1 and then have it accidentally cancel the next > session with the same backendid. That's why I didn't chase this idea > myself earlier. Hm. I don't think so. Backend Initialization clears those flags. And the signal is sent to a specific pid so you cant send it to a new backend when targeting the old one. So how should that occur? > Closing that loophole isn't a priority and is best left until we have > the other things have been cleaned up. Yes, maybe. It was just rather easy to fix and fixed the problem sufficiently so that I could not reproduce it in contrast to seeing the problem during a regular testrun. >> 2. >> >> AbortTransactionAndAnySubtransactions is only used in the mainloops >> error handler as it should be unproblematic there. >> >> In the current CVS code ConditionalVirtualXactLockTableWait() in >> ResolveRecoveryConflictWithVirtualXIDs does the wait for every try of >> cancelling the other transaction. >> >> I moved the retry logic into CancelVirtualTransaction(). If 50 times a >> ERROR does nothing it degrades to FATAL > > It's a good idea but not the right one, IMHO. I'd rather that the > backend decided whether the instruction results in an ERROR or a FATAL > based upon its own state, rather than have Startup process bang its head > against the wall 50 times without knowing why. I don't think its that easy to keep enough state there in a safe manner. Also the concept of retrying is not new - I just made it degrade and not rewait for max_standby_delay (which imho is a bug). In many cases the retries and repeated ERRORs will be enough to free the backend out of all PG_TRY/CATCH blocks that the next ERROR reaches the mainloop. So the retrying is quite sensible - and you cant do that part in the signalled backend itself. >> XXX: I temporarily do not use the skipExistingConflicts argument of >> GetConflictingVirtualXIDs - I dont understand its purpose and a bit of >> infrastructure is missing right now as the recoveryConflictMode is not >> stored anymore anywhere. Can easily be added back though. Is that a leftover from additional capabilities (deferred conflict handling?)? Because currently there will be never a case with two different cancellations at the same time. Also the current logic throws away a FATAL error if a ERROR is already there. >> 3. >> Add a new error code ERRCODE_QUERY_CANCELED_HS for use with HS >> indicating a failure that is more than a plain >> ERRCODE_QUERY_CANCELED - namely it should not be caught from >> various places like savepoints and in PLs. >> >> Exemplarily I checked for that error code in plpgsql and make it >> uncatcheable. >> >> I am not sure how new errorcode codes get chosen though - and the name >> is not that nice. >> >> Opinions on that? > > I don't trust it yet, as a mechanism. Changing only PL/pgSQL seems > fairly broken also. Again, this seems like something that be handled > separately. Well, that was an exemplary change - its easy to fix that at other places as well. Without any identifier of such an abort I don't see how that could work. We simply cant break out of nested transactions if were outside the mainloop. >> I copied quite a bit from Simons earlier patch... > It changes a few things around and adds the ideas you've mentioned, > though seeing them as code doesn't in itself move the discussion > forwards. There are far too many new ideas in this patch for it to be > even close to acceptable, to me. Yes, lets discuss these additional > ideas, but after a basic patch has been applied. Sure, the "targeted" signalling part can be left of, yes. But the rest is mostly necessary I think. > I do value your interest and input, though racing me to rework my own > patch, then asking me to review it seems like wasted time for both of > us. I will post a new patch on ~Friday. Well. I explicitly asked whether you or somebody else is going after this. Waited two days and only then started working on it. And you seem to have enough on your plate... > (Also, please make your braces { } follow the normal coding conventions > for Postgres.) Sure. >> Currently the patch does not yet do anything to avoid letting the >> protocol out of sync. What do you think about adding a flag for error >> codes not to communicate with the client (Similarly to COMERROR)? >> >> So that one could do an elog(ERROR& ERROR_NO_SEND_CLIENT, .. or such? > Seems fairly important piece. Its quite a bit of manual work though so I wanted to be sure thats an agreeable approach. Andres
pgsql-hackers by date: