Thread: EvalPlanQual seems a tad broken
While fooling around with moving FOR UPDATE into a plan node (WIP version attached), I came across this interesting behavior: 1. Create test tables: create table t1(f1 int, f2 int); insert into t1 values (1,1); insert into t1 values (2,2); create table t2(f3 int, f4 int); insert into t2 values (1,111); insert into t2 values (1,112); insert into t2 values (2,113); insert into t2 values (2,114); 2. Execute test query: select * from t1 join t2 on f1=f3 for update; f1 | f2 | f3 | f4 ----+----+----+----- 1 | 1 | 1 | 111 1 | 1 | 1 | 112 2 | 2 | 2 | 113 2 | 2 | 2 | 114 (4 rows) 3. In another session, execute: begin; update t1 set f2=42 where f1=2; 4. In first session, repeat test query: select * from t1 join t2 on f1=f3 for update; As expected, it blocks waiting for the second session to commit. 5. Now commit in the second session. First session resumes and prints f1 | f2 | f3 | f4 ----+----+----+----- 1 | 1 | 1 | 111 1 | 1 | 1 | 112 2 | 42 | 2 | 113 2 | 42 | 2 | 114 2 | 42 | 2 | 113 2 | 42 | 2 | 114 (6 rows) Of course the expected answer is f1 | f2 | f3 | f4 ----+----+----+----- 1 | 1 | 1 | 111 1 | 1 | 1 | 112 2 | 42 | 2 | 113 2 | 42 | 2 | 114 (4 rows) which is what you'll get if you simply repeat the test query. This isn't a new bug ... the same behavior can be demonstrated as far back as 7.0. The problem is that execMain.c is set up to pull as many rows as it can from execution of an EvalPlanQual query. Then, once that's exhausted, it steps to the next result from the original query. So if a row that requires locking joins to more than one row in some other table, you get duplicated output rows. The attached patch alleviates some cases of this by having the new LockRows plan node lock *all* the target rows, not just one, before firing the EvalPlanQual query. It doesn't fix the example above because only one of the rows being joined is seen to need EPQ treatment. We could improve that by feeding successfully locked rows into the EPQ machinery as well as ones that were found to be outdated. But that would still leave us with two failure cases: 1. if some of the tables being joined are not selected FOR UPDATE. 2. if the select involves any set-returning functions in the targetlist. I think we could get around #1 by having *all* tables in the query marked FOR UPDATE at least in a dummy form, ie give them entries in the rowMarks list and create junk tlist entries to report their current ctid. Then we'd feed all the relevant rows into the EPQ machinery. We'd just not lock the ones we weren't asked to lock. I do not see any very good way around #2. I'm tempted to propose that we just forbid SRFs in the targetlist of a FOR UPDATE query. This could be justified on the same grounds that we forbid aggregate functions there, ie, they destroy the one-to-one correspondence between table rows and SELECT output rows. If you really had to have it you could do something like select srf(...) from (select ... for update) ss; Anyone have a better idea? regards, tom lane
Attachment
On Mon, 2009-10-12 at 11:42 -0400, Tom Lane wrote: > The problem is that execMain.c is set up to pull as many rows as it can > from execution of an EvalPlanQual query. Then, once that's exhausted, > it steps to the next result from the original query. So if a row that > requires locking joins to more than one row in some other table, you > get duplicated output rows. Surely the SQL Standard recognises such queries as failing test 23b) on 7.12 <query specification> (p379, SQL2008 Foundation). So the query is not potentially updateable and should give a runtime error using a FOR UPDATE. Can we add something to check for uniqueness of the join and throw an error when we detect this situation? Seems like a better general solution. > I do not see any very good way around #2. I'm tempted to propose > that we just forbid SRFs in the targetlist of a FOR UPDATE query. > This could be justified on the same grounds that we forbid aggregate > functions there, ie, they destroy the one-to-one correspondence between > table rows and SELECT output rows. If you really had to have it you > could do something like Until we have a way to specify that the return set from an SRF is unique, that seems the only way. That would be desirable because it would allow FunctionScans to be join removed as well. -- Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes: > On Mon, 2009-10-12 at 11:42 -0400, Tom Lane wrote: >> The problem is that execMain.c is set up to pull as many rows as it can >> from execution of an EvalPlanQual query. Then, once that's exhausted, >> it steps to the next result from the original query. So if a row that >> requires locking joins to more than one row in some other table, you >> get duplicated output rows. > Surely the SQL Standard recognises such queries as failing test 23b) on > 7.12 <query specification> (p379, SQL2008 Foundation). So the query is > not potentially updateable and should give a runtime error using a FOR > UPDATE. You're confusing our implementation of FOR UPDATE with the spec. As I said earlier, they are only very loosely related. In particular, our reading of FOR UPDATE is to lock the referenced rows, not to enforce that they are referenced only once. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > 5. Now commit in the second session. First session resumes and > prints > > f1 | f2 | f3 | f4 > ----+----+----+----- > 1 | 1 | 1 | 111 > 1 | 1 | 1 | 112 > 2 | 42 | 2 | 113 > 2 | 42 | 2 | 114 > 2 | 42 | 2 | 113 > 2 | 42 | 2 | 114 > (6 rows) > > Of course the expected answer is > > f1 | f2 | f3 | f4 > ----+----+----+----- > 1 | 1 | 1 | 111 > 1 | 1 | 1 | 112 > 2 | 42 | 2 | 113 > 2 | 42 | 2 | 114 > (4 rows) > > which is what you'll get if you simply repeat the test query. Is this related to issue 4593? (SELECT FOR UPDATE can return results in a sequence inconsistent with actual result rows and the ORDER BY clause -- rows are ordered by the pre-UPDATE values, while the results show the post-UPDATE values.) http://archives.postgresql.org/pgsql-bugs/2009-01/msg00017.php On the face of it, it seems very similar. Will the patch address this anomaly of SELECT FOR UPDATE, too? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Is this related to issue 4593? No, not directly. Now that locking is done in a separate plan node, we could think about addressing #4593 by switching the order of the LockRows and Sort plan nodes, but that has nothing to do with how well EvalPlanQual works. I was planning to start a separate thread discussing whether to do that, but it's a bit moot until we have a fix for EvalPlanQual --- at least one of the possible answers is to throw out EvalPlanQual altogether and do something else, in which case LockRows might go away again. regards, tom lane
I wrote: > [ EvalPlanQual does not work well ] I'm planning to go back to work on this now that we're out of CommitFest. > We could improve that by feeding successfully locked rows into the EPQ > machinery as well as ones that were found to be outdated. But that > would still leave us with two failure cases: > 1. if some of the tables being joined are not selected FOR UPDATE. > 2. if the select involves any set-returning functions in the targetlist. > I think we could get around #1 by having *all* tables in the query > marked FOR UPDATE at least in a dummy form, ie give them entries in > the rowMarks list and create junk tlist entries to report their current > ctid. Then we'd feed all the relevant rows into the EPQ machinery. > We'd just not lock the ones we weren't asked to lock. On further review it seems that a better way to do this is to make things happen inside the EPQ machinery. We need to "freeze" the rows returned by *all* scan nodes, not only the ones referencing real tables --- for example, a join against a VALUES scan node would still be a problem if we don't lock down the VALUES output, since we could end up getting multiple join rows out. This means we can't assume that there is a CTID associated with every scan node that EPQ needs to lock down. What looks like it would work instead is to pass through the current scan tuple for every scan plan node, not only the ones that are FOR UPDATE targets. I'm tempted to try to move the responsibility for this into execScan.c instead of having all the individual scan plan types know about it. > I do not see any very good way around #2. I'm tempted to propose > that we just forbid SRFs in the targetlist of a FOR UPDATE query. > This could be justified on the same grounds that we forbid aggregate > functions there, ie, they destroy the one-to-one correspondence between > table rows and SELECT output rows. If you really had to have it you > could do something like > select srf(...) from (select ... for update) ss; This still seems like a necessary restriction unfortunately :-(. regards, tom lane
I wrote: > On further review it seems that a better way to do this is to make > things happen inside the EPQ machinery. We need to "freeze" the rows > returned by *all* scan nodes, not only the ones referencing real tables > --- for example, a join against a VALUES scan node would still be a > problem if we don't lock down the VALUES output, since we could end up > getting multiple join rows out. This means we can't assume that there > is a CTID associated with every scan node that EPQ needs to lock down. > What looks like it would work instead is to pass through the current > scan tuple for every scan plan node, not only the ones that are FOR > UPDATE targets. I'm tempted to try to move the responsibility for this > into execScan.c instead of having all the individual scan plan types > know about it. What I had been thinking of when I wrote that was to pass down the ScanTupleSlots from the outer query's scan nodes. That codes up nicely but doesn't work at all :-(. As is obvious in hindsight, the scan nodes are not necessarily still returning the same tuples that contribute to the current join tuple --- for instance if you have a sort-and-mergejoin type of plan, all the scans will be at EOF by the time the top level sees any tuples. So we need to be able to recover the original scan tuples from the join tuple, even for scans that are not to be locked. For real tables this isn't hard, we can pass up the CTID as a junk column the same as we do for tables that are to be locked. It's harder for non-table scans though (VALUES, functions, etc). I can see two conceivably workable alternatives: 1. Pass up the entire row value as a junk whole-row Var. 2. Invent some equivalent to CTID that would allow the row to be recovered later --- for instance, the row number in a tuplestore. One problem with this is that not all those scan types use a tuplestore now, so we'd be adding significant overhead. Also, I'm not entirely sure that it can work for scan nodes that get reset and rescanned repeatedly. Some of them clear and refill the tuplestore when they do that, and the refill isn't necessarily the same row values. Fortunately, this case probably doesn't arise that much in practice, so while it needs to work I doubt that performance is critical. I'm planning to try alternative #1 next, but I wonder if anyone has a better idea? regards, tom lane