Re: RFC: Logging plan of the running query - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: RFC: Logging plan of the running query |
Date | |
Msg-id | CA+TgmobH+Uto9MCD+vWc71bVbOnd7d8zeYjRT8nXaeLe5hsNJQ@mail.gmail.com Whole thread Raw |
In response to | Re: RFC: Logging plan of the running query (torikoshia <torikoshia@oss.nttdata.com>) |
Responses |
Re: RFC: Logging plan of the running query
|
List | pgsql-hackers |
Hi, I've just been catching up on this thread. + if (MyProc->heldLocks) + { + ereport(LOG_SERVER_ONLY, + errmsg("ignored request for logging query plan due to lock conflicts"), + errdetail("You can try again in a moment.")); + return; + } I don't like this for several reasons. First, I think it's not nice to have a request just get ignored. A user will expect that if we cannot immediately respond to some request, we'll respond later at the first time that it's safe to do so, rather than just ignoring it and telling them to retry. Second, I don't think that the error message is very good. It talks about lock conflicts, but we don't know that there is any real problem. We know that, if we enter this block, the server is in the middle of trying to acquire some lock, and we also know that we could attempt to acquire a lock as part of generating the EXPLAIN output, and maybe that's an issue. But that's not a lock conflict. That's a re-entrancy problem. I don't know that we want to talk about re-entrancy problems in an error message, and I don't really think this error message should exist in the first place, but if we're going to error out in this case surely we shouldn't do so with an error message that describes a problem we don't have. Third, I think that the re-entrancy problems with this patch may extend well beyond lock acquisition. This is one example of how it can be unsafe to do something as complicated as EXPLAIN at any arbitrary CHECK_FOR_INTERRUPTS(). It is not correct to say, as http://postgr.es/m/CAAaqYe9euUZD8bkjXTVcD9e4n5c7kzHzcvuCJXt-xds9X4c7Fw@mail.gmail.com does, that the problems with running EXPLAIN at an arbitrary point are specific to running this code in an aborted transaction. The existence of this code shows that there is at least one hazard even if the current transaction is not aborted, and I see no analysis on this thread indicating whether there are any more such hazards, or how we could go about finding them all. I think the issue is very general. We have lots of subsystems that both (a) use global variables and (b) contain CHECK_FOR_INTERRUPTS(). If we process an interrupt while that code is in the middle of manipulating its global variables and then again re-enter that code, bad things might happen. We could try to rule that out by analyzing all such subsystems and all places where CHECK_FOR_INTERRUPTS() may appear, but that's very difficult. Suppose we took the alternative approach recommended by Andrey Lepikhov in http://postgr.es/m/b1b110ae-61f6-4fd9-9b94-f967db9b53d4@app.fastmail.com and instead set a flag that gets handled in some suitable place in the executor code, like ExecProcNode(). If we did that, then we would know that we're not in the middle of a syscache lookup, a catcache lookup, a heavyweight lock acquisition, an ereport, or any other low-level subsystem call that might create problems for the EXPLAIN code. In that design, the hack shown above would go away, and we could be much more certain that we don't need any other similar hacks for other subsystems. The only downside is that we might experience a slightly longer delay before a requested EXPLAIN plan actually shows up, but that seems like a pretty small price to pay for being able to reason about the behavior of the system. I don't *think* there are any cases where we run in the executor for a particularly long time without a new call to ExecProcNode(). I think this case is a bit like vacuum_delay_point(). You might think that vacuum_delay_point() could be moved inside of CHECK_FOR_INTERRUPTS(), but we've made the opposite decision: every vacuum_delay_point() calls CHECK_FOR_INTERRUPTS() but not every CHECK_FOR_INTERRUPTS() calls vacuum_delay_point(). That means that we can allow vacuum_delay_point() only at cases where we know it's safe, rather than at every CHECK_FOR_INTERRUPTS(). I think that's a pretty smart decision, even for vacuum_delay_point(), and AFAICS the code we're proposing to run here does things that are substantially more complicated than what vacuum_delay_point() does. That code just does a bit of reading of shared memory, reports a wait event, and sleeps. That's really simple compared to code that could try to do relcache builds, which can trigger syscache lookups, which can both trigger heavyweight lock acquisition, lightweight lock acquisition, bringing pages into shared_buffers possibly through physical I/O, processing of invalidation messages, and a bunch of other stuff. It's really hard for me to accept that the heavyweight lock problem for which the patch contains a workaround is the only one that exists. I can't see any reason why that should be true. ...Robert
pgsql-hackers by date: