Thread: why partition pruning doesn't work?
Hi
CREATE TABLE data(a text, vlozeno date) PARTITION BY RANGE(vlozeno); CREATE TABLE data_2016 PARTITION OF data FOR VALUES FROM ('2016-01-01') TO ('2016-12-31'); CREATE TABLE data_2017 PARTITION OF data FOR VALUES FROM ('2017-01-01') TO ('2017-12-31'); CREATE TABLE data_other PARTITION OF DATA DEFAULT;
insert into data select 'ahoj', '2016-01-01'::date + (random() * 900)::int from generate_series(1,1000000);
analyze data;
postgres=# explain analyze select * from data where vlozeno > '2018-06-01';
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Append (cost=0.00..3519.83 rows=20001 width=9) (actual time=0.042..27.750 rows=19428 loops=1) │
│ -> Seq Scan on data_other (cost=0.00..3419.83 rows=20001 width=9) (actual time=0.040..25.895 rows=19428 loops=1) │
│ Filter: (vlozeno > '2018-06-01'::date) │
│ Rows Removed by Filter: 171518 │
│ Planning Time: 0.766 ms │
│ Execution Time: 28.718 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)
postgres=# explain analyze select * from data where vlozeno > current_date;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Gather (cost=1000.00..17281.36 rows=20080 width=9) (actual time=0.749..95.389 rows=19428 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Parallel Append (cost=0.00..14273.36 rows=8367 width=9) (actual time=59.141..89.458 rows=6476 loops=3) │
│ -> Parallel Seq Scan on data_2016 (cost=0.00..5768.69 rows=24 width=9) (actual time=34.847..34.847 rows=0 loops=3) │
│ Filter: (vlozeno > CURRENT_DATE) │
│ Rows Removed by Filter: 135119 │
│ -> Parallel Seq Scan on data_2017 (cost=0.00..5745.02 rows=23 width=9) (actual time=53.269..53.269 rows=0 loops=2) │
│ Filter: (vlozeno > CURRENT_DATE) │
│ Rows Removed by Filter: 201848 │
│ -> Parallel Seq Scan on data_other (cost=0.00..2717.82 rows=11765 width=9) (actual time=0.044..55.502 rows=19428 loops=1) │
│ Filter: (vlozeno > CURRENT_DATE) │
│ Rows Removed by Filter: 171518 │
│ Planning Time: 0.677 ms │
│ Execution Time: 98.349 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(15 rows)
but
postgres=# explain analyze select * from data where vlozeno > (select current_date);
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Append (cost=0.01..19574.68 rows=333333 width=9) (actual time=0.095..31.945 rows=19428 loops=1) │
│ InitPlan 1 (returns $0) │
│ -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=1) │
│ -> Seq Scan on data_2016 (cost=0.00..7258.98 rows=135119 width=9) (never executed) │
│ Filter: (vlozeno > $0) │
│ -> Seq Scan on data_2017 (cost=0.00..7229.20 rows=134565 width=9) (never executed) │
│ Filter: (vlozeno > $0) │
│ -> Seq Scan on data_other (cost=0.00..3419.83 rows=63649 width=9) (actual time=0.069..29.856 rows=19428 loops=1) │
│ Filter: (vlozeno > $0) │
│ Rows Removed by Filter: 171518 │
│ Planning Time: 0.418 ms │
│ Execution Time: 33.019 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(12 rows)
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Append (cost=0.01..19574.68 rows=333333 width=9) (actual time=0.095..31.945 rows=19428 loops=1) │
│ InitPlan 1 (returns $0) │
│ -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=1) │
│ -> Seq Scan on data_2016 (cost=0.00..7258.98 rows=135119 width=9) (never executed) │
│ Filter: (vlozeno > $0) │
│ -> Seq Scan on data_2017 (cost=0.00..7229.20 rows=134565 width=9) (never executed) │
│ Filter: (vlozeno > $0) │
│ -> Seq Scan on data_other (cost=0.00..3419.83 rows=63649 width=9) (actual time=0.069..29.856 rows=19428 loops=1) │
│ Filter: (vlozeno > $0) │
│ Rows Removed by Filter: 171518 │
│ Planning Time: 0.418 ms │
│ Execution Time: 33.019 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(12 rows)
Partition pruning is working now.
Is it expected? Tested on fresh master.
The commit message https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=499be013de65242235ebdde06adb08db887f0ea5 says so append should be supported.
Regards
Pavel
> On 1 June 2018 at 07:19, Pavel Stehule <pavel.stehule@gmail.com> wrote: > > Partition pruning is working now. > > Is it expected? Tested on fresh master. That's interesting. So there are two cases: * vlozeno > (select current_date) (pruning works) * vlozeno > current_date (pruning doesn't work) In pull_partkey_params when we need to extract Params matching partition key in the first case everything is fine, we've got an expr of type Param. In the second case we've got a SQLValueFunction, which is ignored in the code - so eventually we think that there is nothing matching a partition key and we don't need to apply pruning. With the attached hacky patch it would be taken into account (although I assume in reality SQLValueFunction should be treated somehow differently) and pruning is happening: =# explain analyze select * from data where vlozeno > current_date; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------ Gather (cost=1000.00..17223.38 rows=19512 width=9) (actual time=0.456..32.952 rows=19340 loop s=1) Workers Planned: 2 Workers Launched: 2 -> Parallel Append (cost=0.00..14272.18 rows=8130 width=9) (actual time=0.042..26.616 rows =6447 loops=3) -> Parallel Seq Scan on data_2016 (cost=0.00..5771.19 rows=24 width=9) (never executed) Filter: (vlozeno > CURRENT_DATE) -> Parallel Seq Scan on data_2017 (cost=0.00..5747.65 rows=23 width=9) (never executed) Filter: (vlozeno > CURRENT_DATE) -> Parallel Seq Scan on data_other (cost=0.00..2712.69 rows=11431 width=9) (actual time=0.032..26.031 rows=6447 loops=3) Filter: (vlozeno > CURRENT_DATE) Rows Removed by Filter: 57084 Planning Time: 1.256 ms Execution Time: 35.327 ms (13 rows) Time: 40.291 ms
Attachment
On Fri, Jun 1, 2018 at 9:47 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> On 1 June 2018 at 07:19, Pavel Stehule <pavel.stehule@gmail.com> wrote: >> >> Partition pruning is working now. >> >> Is it expected? Tested on fresh master. > > That's interesting. So there are two cases: > > * vlozeno > (select current_date) (pruning works) > > * vlozeno > current_date (pruning doesn't work) > > In pull_partkey_params when we need to extract Params matching partition key in > the first case everything is fine, we've got an expr of type Param. In the > second case we've got a SQLValueFunction, which is ignored in the code - so > eventually we think that there is nothing matching a partition key and we don't > need to apply pruning. > > With the attached hacky patch it would be taken into account (although I assume > in reality SQLValueFunction should be treated somehow differently) and pruning > is happening: I think the patch is right if we were to handle only SQLValueFunction, but the bigger picture here is that we aren't evaluating stable functions before run-time partition pruning happens. I was under the impression that the stable functions/expressions get evaluated and folded into a constant just before the execution begins since a stable function produces the same output for same input during one execution invocation. But I am not able to find where we do that and probably we don't do that at all. If we could do that then it's matter of using same methods as plan-time partition pruning to prune the partitions. If we go ahead with this patch, we should at least update it to handle stable functions for the sake of completeness. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes: > I think the patch is right if we were to handle only SQLValueFunction, > but the bigger picture here is that we aren't evaluating stable > functions before run-time partition pruning happens. I was under the > impression that the stable functions/expressions get evaluated and > folded into a constant just before the execution begins since a stable > function produces the same output for same input during one execution > invocation. But I am not able to find where we do that and probably we > don't do that at all. We don't; there was a patch floating around to make that happen, but it hasn't been updated lately. I agree though that it seems strange to special-case SQLValueFunction rather than any-stable-expression. As long as the evaluation happens at executor start (i.e. with the query's run-time snapshot) it should be reasonable to simplify any stable expression. It's worth questioning whether this is a bug fix or an improvement. If the latter, it probably ought to wait for v12. regards, tom lane
2018-06-01 17:53 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us>:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
> I think the patch is right if we were to handle only SQLValueFunction,
> but the bigger picture here is that we aren't evaluating stable
> functions before run-time partition pruning happens. I was under the
> impression that the stable functions/expressions get evaluated and
> folded into a constant just before the execution begins since a stable
> function produces the same output for same input during one execution
> invocation. But I am not able to find where we do that and probably we
> don't do that at all.
We don't; there was a patch floating around to make that happen, but
it hasn't been updated lately.
I agree though that it seems strange to special-case SQLValueFunction
rather than any-stable-expression. As long as the evaluation happens
at executor start (i.e. with the query's run-time snapshot) it should
be reasonable to simplify any stable expression.
It's worth questioning whether this is a bug fix or an improvement.
If the latter, it probably ought to wait for v12.
The result is correct - but it was unpleasant surprise. I searched the most simple demo for this feature, and it doesn't work. Filtering based on CURRENT_DATE is often.
Regards
Pavel
regards, tom lane
On Fri, Jun 1, 2018 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I agree though that it seems strange to special-case SQLValueFunction
rather than any-stable-expression. As long as the evaluation happens
at executor start (i.e. with the query's run-time snapshot) it should
be reasonable to simplify any stable expression.
It's worth questioning whether this is a bug fix or an improvement.
If the latter, it probably ought to wait for v12.
If explaining the change requires reference to tokens from the source code, rather than something an end user could understand, I'd argue it is a bug fix rather than an improvement.
Cheers,
Jeff
On Sat, Jun 2, 2018 at 5:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote: > On Fri, Jun 1, 2018 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> >> I agree though that it seems strange to special-case SQLValueFunction >> rather than any-stable-expression. As long as the evaluation happens >> at executor start (i.e. with the query's run-time snapshot) it should >> be reasonable to simplify any stable expression. >> >> It's worth questioning whether this is a bug fix or an improvement. >> If the latter, it probably ought to wait for v12. > > > If explaining the change requires reference to tokens from the source code, > rather than something an end user could understand, I'd argue it is a bug > fix rather than an improvement. If we going to implement stable expression folding before the actual execution starts, that's a feature in itself. So, it's V12 material. Partition pruning will use that feature. I don't think we should make partition pruning work with stable expressions in some ad-hoc way in V11 and the some future release (mostly V12) implements it on top of stable expression folding feature. So my vote for making it work in V12. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
Jeff Janes <jeff.janes@gmail.com> writes: > On Fri, Jun 1, 2018 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It's worth questioning whether this is a bug fix or an improvement. >> If the latter, it probably ought to wait for v12. > If explaining the change requires reference to tokens from the source code, > rather than something an end user could understand, I'd argue it is a bug > fix rather than an improvement. Well, the difference between volatile, stable and immutable functions is well-documented, so I don't think that's a great argument. If there's some aspect of this behavior that's not predictable from understanding which class the partition key expression falls into, then I could agree that's a bug. regards, tom lane
> On 1 June 2018 at 17:53, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes: >> I think the patch is right if we were to handle only SQLValueFunction, >> but the bigger picture here is that we aren't evaluating stable >> functions before run-time partition pruning happens. > > I agree though that it seems strange to special-case SQLValueFunction > rather than any-stable-expression. As long as the evaluation happens > at executor start (i.e. with the query's run-time snapshot) it should > be reasonable to simplify any stable expression. Just to clarify for myself, for evaluating any stable function here would it be enough to handle all function-like expressions (FuncExpr / OpExpr / DistinctExpr / NullIfExpr) and check a corresponding function for provolatile, like in the attached patch?
Attachment
Dmitry Dolgov <9erthalion6@gmail.com> writes: > Just to clarify for myself, for evaluating any stable function here would it be > enough to handle all function-like expressions (FuncExpr / OpExpr / > DistinctExpr / NullIfExpr) and check a corresponding function for provolatile, > like in the attached patch? I think the entire approach is wrong here. Rather than concerning yourself with Params, or any other specific expression type, you should be using !contain_volatile_functions() to decide whether an expression is run-time-constant. If it is, use the regular expression evaluation machinery to extract the value. regards, tom lane
> On 3 June 2018 at 19:11, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Dmitry Dolgov <9erthalion6@gmail.com> writes: >> Just to clarify for myself, for evaluating any stable function here would it be >> enough to handle all function-like expressions (FuncExpr / OpExpr / >> DistinctExpr / NullIfExpr) and check a corresponding function for provolatile, >> like in the attached patch? > > I think the entire approach is wrong here. Rather than concerning > yourself with Params, or any other specific expression type, you > should be using !contain_volatile_functions() to decide whether > an expression is run-time-constant. If it is, use the regular > expression evaluation machinery to extract the value. Yes, it makes sense. Then, to my understanding, the attached code is close to what was described above (although I'm not sure about the Const part).
Attachment
Hi Dmitry, Thanks for creating the patch. I looked at it and have some comments. On 2018/06/04 22:30, Dmitry Dolgov wrote: >> On 3 June 2018 at 19:11, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Dmitry Dolgov <9erthalion6@gmail.com> writes: >>> Just to clarify for myself, for evaluating any stable function here would it be >>> enough to handle all function-like expressions (FuncExpr / OpExpr / >>> DistinctExpr / NullIfExpr) and check a corresponding function for provolatile, >>> like in the attached patch? >> >> I think the entire approach is wrong here. Rather than concerning >> yourself with Params, or any other specific expression type, you >> should be using !contain_volatile_functions() to decide whether >> an expression is run-time-constant. If it is, use the regular >> expression evaluation machinery to extract the value. > > Yes, it makes sense. Then, to my understanding, the attached code is close to > what was described above (although I'm not sure about the Const part). This: @@ -2727,6 +2730,13 @@ pull_partkey_params(PartitionPruneInfo *pinfo, List *steps) } gotone = true; } + else if (!IsA(expr, Const)) + { + Param *param = (Param *) expr; + pinfo->execparams = bms_add_member(pinfo->execparams, + param->paramid); + gotone = true; + } doesn't look quite right. What says expr is really a Param? The patch appears to work because, by setting pinfo->execparams to *something*, it triggers execution-time pruning to run; its contents aren't necessarily used during execution pruning. In fact, it would've crashed if the execution-time pruning code had required execparams to contain *valid* param id, but currently it doesn't. What I think we'd need to do to make this work is to make execution-time pruning be invoked even if there aren't any Params involved. IOW, let's try to teach make_partition_pruneinfo that it can go ahead also in the cases where there are expressions being compared with the partition key that contain (only) stable functions. Then, go and fix the execution-pruning code to not *always* expect there to be Params to prune with. Maybe, David (added to cc) has something to say about all this, especially whether he considers this a feature and not a bug fix. Thanks, Amit
> On 5 June 2018 at 12:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > > doesn't look quite right. What says expr is really a Param? The patch > appears to work because, by setting pinfo->execparams to *something*, it > triggers execution-time pruning to run; its contents aren't necessarily > used during execution pruning. In fact, it would've crashed if the > execution-time pruning code had required execparams to contain *valid* > param id, but currently it doesn't. > > What I think we'd need to do to make this work is to make execution-time > pruning be invoked even if there aren't any Params involved. IOW, let's > try to teach make_partition_pruneinfo that it can go ahead also in the > cases where there are expressions being compared with the partition key > that contain (only) stable functions. Then, go and fix the > execution-pruning code to not *always* expect there to be Params to prune > with. Yeah, I agree - I copied this approach mindlessly from the original hacky patch. So, looks like it's necessary to have something like got_stable_expr together with gotparam. And after that the only place where I see Params are in use is partkey_datum_from_expr where all the stuff is actually evaluated. So apparently this part about "fix the execution-pruning code to not *always* expect there to be Params to prune with" will be only about this function - am I correct or there is something else that I missed?
On Tue, Jun 5, 2018 at 6:24 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote: >> On 5 June 2018 at 12:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> >> doesn't look quite right. What says expr is really a Param? The patch >> appears to work because, by setting pinfo->execparams to *something*, it >> triggers execution-time pruning to run; its contents aren't necessarily >> used during execution pruning. In fact, it would've crashed if the >> execution-time pruning code had required execparams to contain *valid* >> param id, but currently it doesn't. >> >> What I think we'd need to do to make this work is to make execution-time >> pruning be invoked even if there aren't any Params involved. IOW, let's >> try to teach make_partition_pruneinfo that it can go ahead also in the >> cases where there are expressions being compared with the partition key >> that contain (only) stable functions. Then, go and fix the >> execution-pruning code to not *always* expect there to be Params to prune >> with. > > Yeah, I agree - I copied this approach mindlessly from the original hacky > patch. So, looks like it's necessary to have something like got_stable_expr > together with gotparam. I think the current code is heavily relying on Params to be present for partition pruning, which isn't true. Runtime partition pruning is possible when there are comparison conditions with partition key expressions on one side and "execution time constant" expressions on the other side. By "execution time constant" expression, I mean any expression that evaluates to a constant at the time of execution like a stable expressions (not just functions) or a Param expression. I can think of only these two at this time, but there can be more. So, gotparam should be renamed as "gotprunable_cond" to be generic. pull_partkey_params() should be renamed as "pull_partkey_conds" or something generic. That function would return true if there exists an expression in steps which can be evaluated to a constant at runtime, otherwise it returns false. My guess is there will be false-positives which need to be dealt with later, but there will be no false-negatives. > And after that the only place where I see Params > are in use is partkey_datum_from_expr where all the stuff is actually > evaluated. So apparently this part about "fix the execution-pruning code to not > *always* expect there to be Params to prune with" will be only about this > function - am I correct or there is something else that I missed? Yes. But I think trying to evaluate parameters in this function is not good. The approach of folding constant expressions before or immediately after the execution starts doesn't require the expressions to be evaluated in partkey_datum_from_expr and might benefit other places where stable expressions or params can appear. Other problem with partkey_datum_from_expr() seems to be that it evaluated only param nodes but not the expressions involving parameters which can folded into constants at runtime. Take for example following queries on table t1 with two partitions (0, 100) and (100, 200), populated using "insert into t1 select i, i from generate_series(0, 199) i;". There's an index on t1(a). explain analyze select * from t1 x left join t1 y on x.a = y.b where y.a = 5; QUERY PLAN ------------------------------------------------------------------------------------------------------------ Nested Loop (cost=0.00..6.78 rows=1 width=16) (actual time=0.033..0.066 rows=1 loops=1) -> Append (cost=0.00..2.25 rows=1 width=8) (actual time=0.019..0.035 rows=1 loops=1) -> Seq Scan on t1p1 y (cost=0.00..2.25 rows=1 width=8) (actual time=0.018..0.035 rows=1 loops=1) Filter: (a = 5) Rows Removed by Filter: 99 -> Append (cost=0.00..4.51 rows=2 width=8) (actual time=0.011..0.027 rows=1 loops=1) -> Seq Scan on t1p1 x (cost=0.00..2.25 rows=1 width=8) (actual time=0.006..0.022 rows=1 loops=1) Filter: (y.b = a) Rows Removed by Filter: 99 -> Seq Scan on t1p2 x_1 (cost=0.00..2.25 rows=1 width=8) (never executed) Filter: (y.b = a) Planning Time: 0.644 ms Execution Time: 0.115 ms (13 rows) t1p2 x_1 is never scanned indicating that run time partition pruning happened. But then see the following query postgres:17889=#explain analyze select * from t1 x left join t1 y on x.a = y.b + 100 where y.a = 5; QUERY PLAN -------------------------------------------------------------------------------------------------------------- Nested Loop (cost=0.00..7.28 rows=1 width=16) (actual time=0.055..0.093 rows=1 loops=1) -> Append (cost=0.00..2.25 rows=1 width=8) (actual time=0.017..0.034 rows=1 loops=1) -> Seq Scan on t1p1 y (cost=0.00..2.25 rows=1 width=8) (actual time=0.016..0.033 rows=1 loops=1) Filter: (a = 5) Rows Removed by Filter: 99 -> Append (cost=0.00..5.01 rows=2 width=8) (actual time=0.034..0.054 rows=1 loops=1) -> Seq Scan on t1p1 x (cost=0.00..2.50 rows=1 width=8) (actual time=0.026..0.026 rows=0 loops=1) Filter: ((y.b + 100) = a) Rows Removed by Filter: 100 -> Seq Scan on t1p2 x_1 (cost=0.00..2.50 rows=1 width=8) (actual time=0.007..0.027 rows=1 loops=1) Filter: ((y.b + 100) = a) Rows Removed by Filter: 99 Planning Time: 0.424 ms Execution Time: 0.139 ms (14 rows) The scan on t1p1 x returns no rows and should have been pruned since y.b + 100 is constant for a given y.b. But for this to work, folding constant expressions doesn't help since y.b changes with every rescan of t1 x. So may be we need some way to constant fold expression during ExecutorRewind() as well. This is digression from the original report, but it's still within the scope of "why partition pruning doesn't work?" -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 5 June 2018 at 22:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Maybe, David (added to cc) has something to say about all this, especially > whether he considers this a feature and not a bug fix. Thanks, Amit. I had missed this thread. Yeah. I admit if I'd thought about this case when I wrote the code, then I'd have made any non-volatile Expr work, but I didn't :-( It was pointed out to be a few months ago in a comment in [1]. I initially thought that this was v12 material, but it seems there are a few people here that are pretty unhappy about it. I was going to describe what such a patch should look like here, but that seemed like about as much work as writing it, so: Please see the attached patch. I've only just finished with it and it's not fully done yet as there's still an XXX comment where I've not quite thought about Exprs with Vars from higher levels. These might always be converted to Params, so the code might be okay as is, but I've not checked this yet, hence the comment remains. I'm slightly leaning towards this being included in v11. Without this people are forced into hacks like WHERE partkey = (SELECT stablefunc()); to get pruning working at all. If that SQL remains after this patch then pruning can only take place during actual execution. With the attached patch the pruning can take place during the initialization of the executor, which in cases with many partitions can be significantly faster, providing actual execution is short. I'd rather people didn't get into bad habits like that if we can avoid it. [1] https://blog.2ndquadrant.com/partition-elimination-postgresql-11/ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
2018-06-05 17:07 GMT+02:00 David Rowley <david.rowley@2ndquadrant.com>:
On 5 June 2018 at 22:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Maybe, David (added to cc) has something to say about all this, especially
> whether he considers this a feature and not a bug fix.
Thanks, Amit. I had missed this thread.
Yeah. I admit if I'd thought about this case when I wrote the code,
then I'd have made any non-volatile Expr work, but I didn't :-(
It was pointed out to be a few months ago in a comment in [1]. I
initially thought that this was v12 material, but it seems there are a
few people here that are pretty unhappy about it.
I was going to describe what such a patch should look like here, but
that seemed like about as much work as writing it, so:
Please see the attached patch. I've only just finished with it and
it's not fully done yet as there's still an XXX comment where I've not
quite thought about Exprs with Vars from higher levels. These might
always be converted to Params, so the code might be okay as is, but
I've not checked this yet, hence the comment remains.
I'm slightly leaning towards this being included in v11. Without this
people are forced into hacks like WHERE partkey = (SELECT
stablefunc()); to get pruning working at all. If that SQL remains
after this patch then pruning can only take place during actual
execution. With the attached patch the pruning can take place during
the initialization of the executor, which in cases with many
partitions can be significantly faster, providing actual execution is
short. I'd rather people didn't get into bad habits like that if we
can avoid it.
This is really great
Regards
Pavel
[1] https://blog.2ndquadrant.com/partition-elimination- postgresql-11/
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 6 June 2018 at 03:07, David Rowley <david.rowley@2ndquadrant.com> wrote: > Please see the attached patch. I've only just finished with it and > it's not fully done yet as there's still an XXX comment where I've not > quite thought about Exprs with Vars from higher levels. These might > always be converted to Params, so the code might be okay as is, but > I've not checked this yet, hence the comment remains. I looked at this again today and decided that the XXX comment could just be removed. I also changed contains_only_safeparams into contains_unsafeparams and reversed the condition. I then decided that I didn't like the way we need to check which params are in the Expr each time we call partkey_datum_from_expr. It seems better to prepare this in advance when building the pruning steps. I started work on that, but soon realised that I'd need to pass a List of Bitmapsets to the executor. This is a problem as Bitmapset is not a Node type and cannot be copied with COPY_NODE_FIELD(). Probably this could be refactored to instead of passing 3 Lists in the PartitionPruneStepOp we could invent a new node type that just has 3 fields and store a single List. Anyway, I didn't do that because I'm not sure what the fate of this patch is going to be. I'd offer to change things around to add a new Node type, but I don't really want to work on that now if this is v12 material. I've attached a cleaned up version of the patch I posted yesterday. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/06/06 14:10, David Rowley wrote: > I then decided that > I didn't like the way we need to check which params are in the Expr > each time we call partkey_datum_from_expr. It seems better to prepare > this in advance when building the pruning steps. I started work on > that, but soon realised that I'd need to pass a List of Bitmapsets to > the executor. This is a problem as Bitmapset is not a Node type and > cannot be copied with COPY_NODE_FIELD(). Probably this could be > refactored to instead of passing 3 Lists in the PartitionPruneStepOp > we could invent a new node type that just has 3 fields and store a > single List. I wonder why we need to create those Bitmapsets in the planner? Why not in ExecSetupPartitionPruneState()? For example, like how context->exprstates is initialized. Thanks, Amit
On 6 June 2018 at 18:05, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > On 2018/06/06 14:10, David Rowley wrote: >> I then decided that >> I didn't like the way we need to check which params are in the Expr >> each time we call partkey_datum_from_expr. It seems better to prepare >> this in advance when building the pruning steps. I started work on >> that, but soon realised that I'd need to pass a List of Bitmapsets to >> the executor. This is a problem as Bitmapset is not a Node type and >> cannot be copied with COPY_NODE_FIELD(). Probably this could be >> refactored to instead of passing 3 Lists in the PartitionPruneStepOp >> we could invent a new node type that just has 3 fields and store a >> single List. > > I wonder why we need to create those Bitmapsets in the planner? Why not > in ExecSetupPartitionPruneState()? For example, like how > context->exprstates is initialized. That seems like a good idea. Certainly much better than working them out each time we prune. v3 patch attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 2018/06/06 18:52, David Rowley wrote: > On 6 June 2018 at 18:05, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> On 2018/06/06 14:10, David Rowley wrote: >>> I then decided that >>> I didn't like the way we need to check which params are in the Expr >>> each time we call partkey_datum_from_expr. It seems better to prepare >>> this in advance when building the pruning steps. I started work on >>> that, but soon realised that I'd need to pass a List of Bitmapsets to >>> the executor. This is a problem as Bitmapset is not a Node type and >>> cannot be copied with COPY_NODE_FIELD(). Probably this could be >>> refactored to instead of passing 3 Lists in the PartitionPruneStepOp >>> we could invent a new node type that just has 3 fields and store a >>> single List. >> >> I wonder why we need to create those Bitmapsets in the planner? Why not >> in ExecSetupPartitionPruneState()? For example, like how >> context->exprstates is initialized. > > That seems like a good idea. Certainly much better than working them > out each time we prune. > > v3 patch attached. Thanks David. This one looks good. I also like it that hasparamlessexprs is no longer determined and set in the planner. I checked what happens with the cases that Ashutosh complained about upthread and seems that the pruning works as expected. create table t1 (a int, b int) partition by range (a); create table t1p1 partition of t1 for values from (0) to (100); create table t1p2 partition of t1 for values from (100) to (200); create index on t1 (a); insert into t1 select i, i from generate_series(0, 199) i; explain (costs off, analyze) select * from t1 x left join t1 y on x.a = y.b + 100 where y.a = 5; QUERY PLAN ----------------------------------------------------------------------------------------------- Nested Loop (actual time=0.294..0.371 rows=1 loops=1) -> Append (actual time=0.067..0.092 rows=1 loops=1) -> Bitmap Heap Scan on t1p1 y (actual time=0.049..0.059 rows=1 loops=1) Recheck Cond: (a = 5) Heap Blocks: exact=1 -> Bitmap Index Scan on t1p1_a_idx (actual time=0.022..0.022 rows=1 loops=1) Index Cond: (a = 5) -> Append (actual time=0.192..0.219 rows=1 loops=1) -> Index Scan using t1p1_a_idx on t1p1 x (never executed) Index Cond: (a = (y.b + 100)) -> Index Scan using t1p2_a_idx on t1p2 x_1 (actual time=0.134..0.145 rows=1 loops=1) Index Cond: (a = (y.b + 100)) Planning Time: 5.314 ms Execution Time: 0.938 ms (14 rows) Note that the condition x.a = y.b + 100 is able to prune t1p1, whereas on HEAD it isn't. Thanks, Amit
On 7 June 2018 at 14:51, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > Thanks David. This one looks good. I also like it that hasparamlessexprs > is no longer determined and set in the planner. Thanks for checking it. > I checked what happens with the cases that Ashutosh complained about > upthread and seems that the pruning works as expected. [...] > explain (costs off, analyze) select * from t1 x left join t1 y on x.a = > y.b + 100 where y.a = 5; Yeah, I added a test to partition_prune.sql that verifies a similar case. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 7, 2018 at 8:51 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 7 June 2018 at 14:51, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> Thanks David. This one looks good. I also like it that hasparamlessexprs >> is no longer determined and set in the planner. > > Thanks for checking it. > >> I checked what happens with the cases that Ashutosh complained about >> upthread and seems that the pruning works as expected. > > [...] > >> explain (costs off, analyze) select * from t1 x left join t1 y on x.a = >> y.b + 100 where y.a = 5; > > Yeah, I added a test to partition_prune.sql that verifies a similar case. Thanks for taking care of that. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
David Rowley <david.rowley@2ndquadrant.com> writes: > On 6 June 2018 at 18:05, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I wonder why we need to create those Bitmapsets in the planner? Why not >> in ExecSetupPartitionPruneState()? For example, like how >> context->exprstates is initialized. > That seems like a good idea. Certainly much better than working them > out each time we prune. > v3 patch attached. Maybe there's something I'm missing here, but I sort of hoped that this patch would nuke all the special-case code for Params in this area. Why is there any need to distinguish them from other stable expressions? IOW, I was hoping for the code to end up simpler, not more complicated. regards, tom lane
On 8 June 2018 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maybe there's something I'm missing here, but I sort of hoped that this > patch would nuke all the special-case code for Params in this area. > Why is there any need to distinguish them from other stable expressions? > > IOW, I was hoping for the code to end up simpler, not more complicated. We need to know which Params exist in the Expr as if there are no Params, or only external Params, then we can run-time prune during startup of the executor. Otherwise, we must leave the pruning until during execution. I really don't want to say goodbye to that optimisation as it's a significant win to save having to initialise the subnodes for all the useless partitions for OLTP type queries. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > On 8 June 2018 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Maybe there's something I'm missing here, but I sort of hoped that this >> patch would nuke all the special-case code for Params in this area. >> Why is there any need to distinguish them from other stable expressions? > We need to know which Params exist in the Expr as if there are no > Params, or only external Params, then we can run-time prune during > startup of the executor. This does not refute my question. Why doesn't the same logic apply to any stable expression? That is, ISTM a Param is a special case of that. regards, tom lane
On Fri, Jun 8, 2018 at 8:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: >> On 8 June 2018 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Maybe there's something I'm missing here, but I sort of hoped that this >>> patch would nuke all the special-case code for Params in this area. >>> Why is there any need to distinguish them from other stable expressions? > >> We need to know which Params exist in the Expr as if there are no >> Params, or only external Params, then we can run-time prune during >> startup of the executor. > > This does not refute my question. Why doesn't the same logic apply > to any stable expression? That is, ISTM a Param is a special case > of that. > +1. I don't think we need to perform pruning at the start of execution, but we could fold all the stable expressions to constants at that time. The PARAM_EXECs can not be folded into constant at execution start since those not assigned any values yet. AFAIU expressions, within a given node, with those parameters can be folded into constants (if possible) during ExecRewind() on that node. We have to perform pruning just before the (Merge)Append node scan starts. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On 8 June 2018 at 15:22, Tom Lane <tgl@sss.pgh.pa.us> wrote: > David Rowley <david.rowley@2ndquadrant.com> writes: >> On 8 June 2018 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Maybe there's something I'm missing here, but I sort of hoped that this >>> patch would nuke all the special-case code for Params in this area. >>> Why is there any need to distinguish them from other stable expressions? > >> We need to know which Params exist in the Expr as if there are no >> Params, or only external Params, then we can run-time prune during >> startup of the executor. > > This does not refute my question. Why doesn't the same logic apply > to any stable expression? That is, ISTM a Param is a special case > of that. Okay, maybe we don't need to know which external params exist, but we need to know if there are any exec params so that we don't try to evaluate an expression with any of those during executor startup. I'll produce a patch which simplifies things in that area. Thanks for looking at this. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 8 June 2018 at 18:14, David Rowley <david.rowley@2ndquadrant.com> wrote: > On 8 June 2018 at 15:22, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> David Rowley <david.rowley@2ndquadrant.com> writes: >>> On 8 June 2018 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>>> Maybe there's something I'm missing here, but I sort of hoped that this >>>> patch would nuke all the special-case code for Params in this area. >>>> Why is there any need to distinguish them from other stable expressions? >> >>> We need to know which Params exist in the Expr as if there are no >>> Params, or only external Params, then we can run-time prune during >>> startup of the executor. >> >> This does not refute my question. Why doesn't the same logic apply >> to any stable expression? That is, ISTM a Param is a special case >> of that. > > Okay, maybe we don't need to know which external params exist, but we > need to know if there are any exec params so that we don't try to > evaluate an expression with any of those during executor startup. > > I'll produce a patch which simplifies things in that area. Okay, I've gotten rid of the tracking of external params. We now just track exec params. We still need to know about these so we know if a re-prune is required during ExecReScanAppend(). Obviously, we don't want to prune on any random Param change, so I'm fairly sure it's a good idea to keep track of these. I've changed the code inside partkey_datum_from_expr so that it's a simple bool array lookup to decide if we can evaluate the expression or not. This bool array is populated during planning, which I think is rather nice so we don't have to go and do it each time the plan is executed. I also discovered that I was needlessly running the pruning code again during executor run in some cases where there was no possibility of doing any further pruning there. I've had to add some new code to set the present_parts inside ExecFindInitialMatchingSubPlans(). It now properly removes the member of any sub-partitions which have had all of their partitions pruned. This allows us just to use 'present_parts' to calculate the subnodes from, rather than going and calling the pruning code again. Technically PartitionPruningData does not really need the do_exec_prune field. A non-empty execparams could indicate this, but I felt it was better to have the bool so that we have one for each method of run-time pruning. This also saves a bms_is_empty() call inside find_subplans_for_params_recurse(). This could be a bit of a hotspot during parameterized nested loops which cause partition pruning. I'm really hoping this is what you meant about the special-case code for Params. Does this look any better? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley <david.rowley@2ndquadrant.com> writes: > I'm really hoping this is what you meant about the special-case code for Params. > Does this look any better? I'm starting to look this over and it seems like generally the right thing, though I'm finding minor things I don't like much. One thing I'm wondering about is why in the world are PartitionPruneInfo and its subsidiary struct types declared in primnodes.h? They are not general-purpose expression nodes, or if they are then there are an awful lot of places that should know about them and do not. AFAICT they might belong in plannodes.h. regards, tom lane
I wrote: > One thing I'm wondering about is why in the world are PartitionPruneInfo > and its subsidiary struct types declared in primnodes.h? Oh, and while I'm bitching: it seems like there is hardly any part of the partitioning code in which the comments aren't desperately in need of a copy-editing pass. They are just chock-full of misspellings, grammar that is faulty enough to make the meaning unclear, and/or errors of fact. An example of the latter is the repeated claims that the basic partitioning functions belong to the planner. Maybe that was true at some stage of development; but AFAICS the logic in question now lives in src/backend/partitioning/, which I would not think is part of the planner. regards, tom lane
On 10 June 2018 at 09:00, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I wrote: >> One thing I'm wondering about is why in the world are PartitionPruneInfo >> and its subsidiary struct types declared in primnodes.h? That may have been a legacy thing that accidentally wasn't changed from a previous version of the patch. I've now moved it to plannodes.h. > Oh, and while I'm bitching: it seems like there is hardly any part of > the partitioning code in which the comments aren't desperately in need > of a copy-editing pass. They are just chock-full of misspellings, > grammar that is faulty enough to make the meaning unclear, and/or > errors of fact. An example of the latter is the repeated claims that > the basic partitioning functions belong to the planner. Maybe that > was true at some stage of development; but AFAICS the logic in question > now lives in src/backend/partitioning/, which I would not think is > part of the planner. I've made a pass over the execPartition.c and partprune.c code attempting to resolve these issues. I have hopefully fixed them all, but I apologise if I've missed any. I also couldn't resist making a few other improvements to the code. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley <david.rowley@2ndquadrant.com> writes: > I've made a pass over the execPartition.c and partprune.c code > attempting to resolve these issues. I have hopefully fixed them all, > but I apologise if I've missed any. > I also couldn't resist making a few other improvements to the code. By the time this arrived, I'd already whacked around your v4 patch quite a bit, so rather than start over I just kept going with what I had, and then tried to merge the useful bits of this one after the fact. I intentionally left out a couple of changes I couldn't get excited about (such as having find_subplans_for_params_recurse return a count), but I think I got everything in v5 otherwise. I'm still fairly unhappy about the state of the comments, though. It's very unclear for example what the subplan_map and subpart_map arrays really are, eg what are they indexed by? I get the impression that only one of them can have a non-minus-1 value for a given index, but that's nowhere explained. Also, we have * partprunedata Array of PartitionPruningData for the plan's target * partitioned relation. First element contains the * details for the target partitioned table. And? What are the other elements, what's the index rule, is there a specific ordering for the other elements? For that matter, "target partitioned table" is content-free. Do you mean topmost partitioned table? I suspect we expect the hierarchy to be flattened such that ancestors appear before children, but that's not stated --- and if it were, this bit about the first element would be a consequence of that. Code-wise, there are some loose ends to be looked at. * Related to the above, doesn't the loop at execPartition.c:1658 need to work back-to-front? Seems like it's trying to propagate info upwards in the hierarchy; looking at a subpartition's present_parts value when you still might change it later doesn't look right at all. * partkey_datum_from_expr and its caller seem pretty brain-dead with respect to nulls. It's not even considering the possibility that a Const has constisnull = true. Now perhaps such a case can't reach here because of plan-time constant-folding, but I don't think this code has any business assuming that. It's also being very stupid about null values from expressions, just throwing up its hands and supposing that nothing can be proven. In reality, doesn't a null guarantee we can eliminate all partitions, since the comparison functions are presumed strict? * I'm fairly suspicious of the fmgr_info and fmgr_info_copy calls in perform_pruning_base_step. Those seem likely to leak memory, and for sure they destroy any opportunity for the called comparison function to cache info in fn_extra --- something that's critical for performance for some comparison functions such as array_eq. Why is it necessary to suppose that those functions could change from one execution to the next? * The business with ExecFindInitialMatchingSubPlans remapping the subplan indexes seems very dubious to me. Surely, that is adding way more complexity and fragility than it's worth. regards, tom lane
Thanks for working on and pushing those fixes. On 11 June 2018 at 10:49, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's very unclear for example what the subplan_map and subpart_map > arrays really are, eg what are they indexed by? I get the impression > that only one of them can have a non-minus-1 value for a given index, > but that's nowhere explained. Also, we have They're indexed by the partition indexes as returned by the partition pruning code. I've included a patch to fix this. > * partprunedata Array of PartitionPruningData for the plan's target > * partitioned relation. First element contains the > * details for the target partitioned table. > > And? What are the other elements, what's the index rule, is there a > specific ordering for the other elements? For that matter, "target > partitioned table" is content-free. Do you mean topmost partitioned > table? I suspect we expect the hierarchy to be flattened such that > ancestors appear before children, but that's not stated --- and if it > were, this bit about the first element would be a consequence of that. I've hopefully improved this comment in the attached patch. Although we may need to delay this until [1] has been addressed Only the index of the first element is important. Looking at add_paths_to_append_rel() the parents will always be listed before their children. That's not very well documented either, but having the top-level parent as the first element is relied upon for more than runtime partition pruning. > Code-wise, there are some loose ends to be looked at. > > * Related to the above, doesn't the loop at execPartition.c:1658 need > to work back-to-front? Seems like it's trying to propagate info upwards > in the hierarchy; looking at a subpartition's present_parts value when > you still might change it later doesn't look right at all. It works just fine front-to-back. That's why I added the 2nd loop. It propagates the work done in the first loop, so the code, as it is now, works if the array is in any order. It may be possible to run the first loop back-to-front and get rid of the 2nd one. I've done this in the attached patch. I felt less confident doing this earlier as the order of that array was not defined well. > * partkey_datum_from_expr and its caller seem pretty brain-dead with > respect to nulls. It's not even considering the possibility that a > Const has constisnull = true. Now perhaps such a case can't reach > here because of plan-time constant-folding, but I don't think this code > has any business assuming that. It's also being very stupid about > null values from expressions, just throwing up its hands and supposing > that nothing can be proven. In reality, doesn't a null guarantee we > can eliminate all partitions, since the comparison functions are > presumed strict? You are right. I admit to not having thought more about that and just taking the safe option. It makes sense to fix it. I imagine it could be a bit annoying that pruning just gives up in such a case. I've added code for this in the attached patch. > * I'm fairly suspicious of the fmgr_info and fmgr_info_copy calls in > perform_pruning_base_step. Those seem likely to leak memory, and > for sure they destroy any opportunity for the called comparison > function to cache info in fn_extra --- something that's critical > for performance for some comparison functions such as array_eq. > Why is it necessary to suppose that those functions could change > from one execution to the next? IIRC the problem there was that we didn't think of a good place to cache the FmgrInfo. But rethinking of that now we could probably use the same method as I used in run-time pruning to cache the exprstate. Amit, what do you think? > * The business with ExecFindInitialMatchingSubPlans remapping the > subplan indexes seems very dubious to me. Surely, that is adding > way more complexity and fragility than it's worth. It's all contained in a single .c file. It does not seem that hard to get it right. Nothing outside of exexPartition.c has any business changing anything in these arrays. Nothing else even has to look at them. I think having the ability to prune during executor initialisation is enormously important. I think without it, this patch is less than half as useful. However, if you didn't mean removing the executor initialise pruning, and just not remapping the arrays, then I'm not sure how we'd do that. I'd previously tried having NULL subnodes in the Append and I thought it required too much hacking work in explain.c, so I decided to just collapse the array instead and do what was required to make it work after having removed the unneeded subplans. Did you have another idea on how to do this? [1] https://www.postgresql.org/message-id/CAKJS1f9KG5nnOFhigWm4ND5Ut-yU075iJyA%2BACVyQnZ-ZW1Qtw%40mail.gmail.com -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
David Rowley <david.rowley@2ndquadrant.com> writes: > On 11 June 2018 at 10:49, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> * The business with ExecFindInitialMatchingSubPlans remapping the >> subplan indexes seems very dubious to me. Surely, that is adding >> way more complexity and fragility than it's worth. > I think having the ability to prune during executor initialisation is > enormously important. I think without it, this patch is less than half > as useful. Sure. > However, if you didn't mean removing the executor > initialise pruning, and just not remapping the arrays, then I'm not > sure how we'd do that. I'd previously tried having NULL subnodes in > the Append and I thought it required too much hacking work in > explain.c, so I decided to just collapse the array instead and do what > was required to make it work after having removed the unneeded > subplans. Did you have another idea on how to do this? No, that was pretty much exactly what I was envisioning. I have not looked at the consequences for explain.c but it seemed like it couldn't be all that bad; and to my mind the remapping business in the partition code is pretty bad. "It's all in one file" is not a suitable justification for unintelligible, overcomplex code. regards, tom lane
On 12 June 2018 at 02:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> However, if you didn't mean removing the executor >> initialise pruning, and just not remapping the arrays, then I'm not >> sure how we'd do that. I'd previously tried having NULL subnodes in >> the Append and I thought it required too much hacking work in >> explain.c, so I decided to just collapse the array instead and do what >> was required to make it work after having removed the unneeded >> subplans. Did you have another idea on how to do this? > > No, that was pretty much exactly what I was envisioning. I have > not looked at the consequences for explain.c but it seemed like > it couldn't be all that bad; and to my mind the remapping business > in the partition code is pretty bad. "It's all in one file" is not > a suitable justification for unintelligible, overcomplex code. By all means, please have a look then. I've been down that route. I didn't like it. I particularly think the memory fragmentation is very good grounds for a good cache hit ratio either, not to mention the slowdown of bms_next_member when there are large gaps in the set. Keep in mind that we may scan the Append many times over when it's on the inside of a nested loop join. What you're proposing exchanges logic that fits neatly into one function for special logic that will be scattered all over explain.c. I really don't think that's the better of the two evils. I'd rather just see my last patch applied which simplifies the re-map code by removing the 2nd loop. Actually, even updating the present_parts to remove the unneeded subpartitioned table indexes is optional. We' just need to give up on the elog(ERROR, "partition missing from subplans"); error and assume that case is fine. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > On 12 June 2018 at 02:26, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> ... I'd previously tried having NULL subnodes in >>> the Append and I thought it required too much hacking work in >>> explain.c, >> No, that was pretty much exactly what I was envisioning. > What you're proposing exchanges logic that fits neatly into one > function for special logic that will be scattered all over explain.c. > I really don't think that's the better of the two evils. As far as I can see, it'd involve about three or four lines' worth of rewrite in one place-you-already-made-quite-ugly in explain.c, and an added if-test in planstate_walk_members, and that'd be it. That seems like a pretty cheap price for being able to vastly simplify the invariants for the pruning functions. In fact, I doubt you'd even need two of them anymore; just one with a bool flag for can-use-exec-params. > I'd rather just see my last patch applied which simplifies the re-map > code by removing the 2nd loop. Actually, even updating the > present_parts to remove the unneeded subpartitioned table indexes is > optional. We' just need to give up on the elog(ERROR, "partition > missing from subplans"); error and assume that case is fine. The fact that you added that loop and only later decided it was unnecessary seems to me to support my point pretty strongly: that code is overcomplicated. regards, tom lane
I wrote: > * I'm fairly suspicious of the fmgr_info and fmgr_info_copy calls in > perform_pruning_base_step. Those seem likely to leak memory, and > for sure they destroy any opportunity for the called comparison > function to cache info in fn_extra --- something that's critical > for performance for some comparison functions such as array_eq. > Why is it necessary to suppose that those functions could change > from one execution to the next? After looking closer, that code isn't just inefficient, it's flat out broken. The reason is that ExecSetupPartitionPruneState thinks it can store some pointers into the target relation's relcache entry in the PartitionPruneContext, and then continue to use those pointers after closing the relcache entry. Nope, you can't do that. The only reason this code has appeared to pass buildfarm testing is that when we do if (cmpfn != context->partsupfunc[keyno].fn_oid) fmgr_info(cmpfn, &partsupfunc[keyno]); else ... if the relcache entry that context->partsupfunc is pointing into has been freed (and overwritten thanks to CLOBBER_FREED_MEMORY), then the function OID comparison generally fails so that we do a fresh fmgr_info call. In the field it's quite likely that we'd accept and attempt to use a partially-clobbered FmgrInfo; but it would only happen if a relcache flush had caused the data to get released, so it could be awhile before anybody saw it happen, let alone reproduced it enough to figure it out. It's easy to demonstrate that there's a problem if you instrument this code to log when the OID comparison fails, and then run the regression tests with -DRELCACHE_FORCE_RELEASE: you get lots of reports like 2018-06-11 18:01:28.686 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.686 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.686 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.686 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.686 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.707 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 2018-06-11 18:01:28.707 EDT [16734] LOG: replace partsupfunc 2139062143 with 351 showing that context->partsupfunc has been overwritten by CLOBBER_FREED_MEMORY. If we had any buildfarm critters running valgrind on RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have detected use of uninitialized memory here ... but I don't think we have any. (The combination of valgrind and CCA would probably be too slow to be practical :-(, though maybe somebody with a fast machine could do the other thing.) Not sure about a good fix for this. It seems annoying to copy the rel's whole partkey data structure into query-local storage, but I'm not sure we have any choice. On the bright side, there might be an opportunity to get rid of repeated runtime fmgr_info lookups in cross-type comparison situations. regards, tom lane
On 06/11/2018 06:24 PM, Tom Lane wrote: > > If we had any buildfarm critters running valgrind on > RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have > detected use of uninitialized memory here ... but I don't think we have > any. (The combination of valgrind and CCA would probably be too slow to > be practical :-(, though maybe somebody with a fast machine could do > the other thing.) > > I don't have a fast machine, but I do have a slow machine already running valgrind and not doing much else :-) Let's see how lousyjack does with -DRELCACHE_FORCE_RELEASE cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > On 06/11/2018 06:24 PM, Tom Lane wrote: >> If we had any buildfarm critters running valgrind on >> RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have >> detected use of uninitialized memory here ... but I don't think we have >> any. (The combination of valgrind and CCA would probably be too slow to >> be practical :-(, though maybe somebody with a fast machine could do >> the other thing.) > I don't have a fast machine, but I do have a slow machine already > running valgrind and not doing much else :-) Let's see how lousyjack > does with -DRELCACHE_FORCE_RELEASE I just tried the case here, and it doesn't even get as far as any of the partitioning tests: it bombs out in inherit.sql :-( ==00:00:06:55.816 26107== Invalid read of size 4 ==00:00:06:55.816 26107== at 0x5F3978: ATExecDropInherit (tablecmds.c:11928) ==00:00:06:55.816 26107== by 0x60212A: ATExecCmd (tablecmds.c:4241) ==00:00:06:55.816 26107== by 0x602CC4: ATController (tablecmds.c:3976) ==00:00:06:55.816 26107== by 0x7910EA: ProcessUtilitySlow (utility.c:1117) ==00:00:06:55.816 26107== by 0x79180F: standard_ProcessUtility (utility.c:920) ==00:00:06:55.816 26107== by 0x78D748: PortalRunUtility (pquery.c:1178) ==00:00:06:55.816 26107== by 0x78E6D0: PortalRunMulti (pquery.c:1331) ==00:00:06:55.816 26107== by 0x78EF8F: PortalRun (pquery.c:799) ==00:00:06:55.816 26107== by 0x78B35C: exec_simple_query (postgres.c:1122) ==00:00:06:55.816 26107== by 0x78C8B3: PostgresMain (postgres.c:4153) ==00:00:06:55.816 26107== by 0x70FBD5: PostmasterMain (postmaster.c:4361) ==00:00:06:55.816 26107== by 0x67AE4F: main (main.c:228) ==00:00:06:55.816 26107== Address 0xe823e90 is 179,504 bytes inside a recently re-allocated block of size 524,288 alloc'd ==00:00:06:55.816 26107== at 0x4A06A2E: malloc (vg_replace_malloc.c:270) ==00:00:06:55.816 26107== by 0x8BBB35: AllocSetAlloc (aset.c:923) ==00:00:06:55.816 26107== by 0x8C4473: palloc (mcxt.c:938) ==00:00:06:55.816 26107== by 0x488DEF: CreateTemplateTupleDesc (tupdesc.c:66) ==00:00:06:55.816 26107== by 0x88D2C0: RelationBuildDesc (relcache.c:416) ==00:00:06:55.816 26107== by 0x8904B5: RelationIdGetRelation (relcache.c:1943) ==00:00:06:55.816 26107== by 0x4C93BF: relation_open (heapam.c:1135) ==00:00:06:55.816 26107== by 0x4D8305: index_open (indexam.c:154) ==00:00:06:55.816 26107== by 0x62D6EB: ExecOpenIndices (execIndexing.c:197) ==00:00:06:55.816 26107== by 0x53B607: CatalogOpenIndexes (indexing.c:49) ==00:00:06:55.816 26107== by 0x556467: recordMultipleDependencies (pg_depend.c:112) ==00:00:06:55.816 26107== by 0x560D44: create_toast_table (toasting.c:385) That one's pretty obvious when you look at the code: /* keep our lock on the parent relation until commit */ heap_close(parent_rel, NoLock); ObjectAddressSet(address, RelationRelationId, RelationGetRelid(parent_rel)); It looks like this might be a fruitful source of creepie-crawlies. regards, tom lane
On 12 June 2018 at 10:24, Tom Lane <tgl@sss.pgh.pa.us> wrote: > After looking closer, that code isn't just inefficient, it's flat > out broken. The reason is that ExecSetupPartitionPruneState thinks > it can store some pointers into the target relation's relcache entry > in the PartitionPruneContext, and then continue to use those pointers > after closing the relcache entry. Nope, you can't do that. I think the best fix is to just have a separate FmgrInfo for each step and partkey comparison. Some FmgrInfos will end up identical, but that's probably a small price to pay. Perhaps they should be separate anyway so that the fn_extra is not shared between different quals comparing to the same partition key? I went with an array of FmgrInfos rather than an array of pointers to FmgrInfos for cache efficiency. This does require that InvalidOid is 0, since I've palloc0'd that memory, and I'm checking if the cache is yet to be populated with: if (!OidIsValid(context->stepcmpfuncs[stateidx].fn_oid)) Patch attached. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 06/11/2018 06:41 PM, Andrew Dunstan wrote: > > > On 06/11/2018 06:24 PM, Tom Lane wrote: >> >> If we had any buildfarm critters running valgrind on >> RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have >> detected use of uninitialized memory here ... but I don't think we have >> any. (The combination of valgrind and CCA would probably be too slow to >> be practical :-(, though maybe somebody with a fast machine could do >> the other thing.) >> >> > > > I don't have a fast machine, but I do have a slow machine already > running valgrind and not doing much else :-) Let's see how lousyjack > does with -DRELCACHE_FORCE_RELEASE > > It added about 20% to the run time. That's tolerable, so I'll leave it on. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jun 12, 2018 at 3:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Not sure about a good fix for this. It seems annoying to copy the > rel's whole partkey data structure into query-local storage, but > I'm not sure we have any choice. On the bright side, there might > be an opportunity to get rid of repeated runtime fmgr_info lookups > in cross-type comparison situations. > We already do that while building part_scheme. So, if we are in planner, it's readily available through RelOptInfo. If we need it in the executor, we need to pass it down from RelOptInfo into one of the execution states. I haven't looked at the patch to exactly figure out which of these is true. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company
On Mon, Jun 11, 2018 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Not sure about a good fix for this. It seems annoying to copy the > rel's whole partkey data structure into query-local storage, but > I'm not sure we have any choice. On the bright side, there might > be an opportunity to get rid of repeated runtime fmgr_info lookups > in cross-type comparison situations. Is this the same issue I raised in https://www.postgresql.org/message-id/flat/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com or a similar issue that creeps up at execution time? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Jun 11, 2018 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Not sure about a good fix for this. It seems annoying to copy the >> rel's whole partkey data structure into query-local storage, but >> I'm not sure we have any choice. On the bright side, there might >> be an opportunity to get rid of repeated runtime fmgr_info lookups >> in cross-type comparison situations. > Is this the same issue I raised in > https://www.postgresql.org/message-id/flat/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com > or a similar issue that creeps up at execution time? Well, it's related to that: *if* we held the relcache entry open for the duration of the query, and *if* holding such a pin were sufficient to guarantee the contents of the entry's partition data couldn't change or even move, then we could avoid doing so much copying. But as we discussed then, neither condition is true, and I don't think either one is cheap to make true. Certainly there's no logic in the relcache to detect changes of partition data like we do for, say, triggers. regards, tom lane
On Tue, Jun 12, 2018 at 12:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Jun 11, 2018 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Not sure about a good fix for this. It seems annoying to copy the >>> rel's whole partkey data structure into query-local storage, but >>> I'm not sure we have any choice. On the bright side, there might >>> be an opportunity to get rid of repeated runtime fmgr_info lookups >>> in cross-type comparison situations. > >> Is this the same issue I raised in >> https://www.postgresql.org/message-id/flat/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com >> or a similar issue that creeps up at execution time? > > Well, it's related to that: *if* we held the relcache entry open for > the duration of the query, and *if* holding such a pin were sufficient > to guarantee the contents of the entry's partition data couldn't change > or even move, then we could avoid doing so much copying. But as we > discussed then, neither condition is true, and I don't think either one is > cheap to make true. Certainly there's no logic in the relcache to detect > changes of partition data like we do for, say, triggers. I think we DO hold relations open for the duration of execution (though not necessarily between planning and execution). And there is code in RelationClearRelation to avoid changing rd_partkey and rd_partdesc if no logical change has occurred. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I think we DO hold relations open for the duration of execution > (though not necessarily between planning and execution). And there is > code in RelationClearRelation to avoid changing rd_partkey and > rd_partdesc if no logical change has occurred. Testing with valgrind + RELCACHE_FORCE_RELEASE is sufficient to disprove that, cf current results from lousyjack (which match my own testing). The partkey *is* disappearing under us. While I've not looked into the exact reasons for that, my first guess is that the partitioned table is not held open because it's not one of the ones to be scanned. Are you prepared to change something like that at this stage of the release cycle? regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > And there is > code in RelationClearRelation to avoid changing rd_partkey and > rd_partdesc if no logical change has occurred. Oh, and by the way, what's actually in there is keep_partkey = (relation->rd_partkey != NULL); I would be interested to see an explanation of why that isn't utterly broken. regards, tom lane
On Tue, Jun 12, 2018 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Testing with valgrind + RELCACHE_FORCE_RELEASE is sufficient to disprove > that, cf current results from lousyjack (which match my own testing). > The partkey *is* disappearing under us. > > While I've not looked into the exact reasons for that, my first guess > is that the partitioned table is not held open because it's not one > of the ones to be scanned. Are you prepared to change something like > that at this stage of the release cycle? The partition key is immutable, so it should NOT be able to disappear out from under us. Once you have defined the partitioning strategy for a table and the partitioning keys associated with it, you can't ever change it. The only reason we need keep_partkey at all, as opposed to just assume that the relevant portion of the relcache entry can't change at all, is because during relation creation we are briefly in a state where the pg_class row exists and the pg_partitioned_table row hasn't been set up yet. So I think your guess that the relation is not kept open is likely to be correct. As for whether to change it at this point in the release cycle, I guess that, to me, depends on how invasive the fix is. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Jun 12, 2018 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> While I've not looked into the exact reasons for that, my first guess >> is that the partitioned table is not held open because it's not one >> of the ones to be scanned. Are you prepared to change something like >> that at this stage of the release cycle? > The partition key is immutable, so it should NOT be able to disappear > out from under us. Hm. That could be better documented. > As for whether to change it at this point in the release cycle, I > guess that, to me, depends on how invasive the fix is. It seems not to be that bad: we just need a shutdown call for the PartitionPruneState, and then we can remember the open relation there. The attached is based on David's patch from yesterday. I'm still a bit annoyed at the fmgr_info_copy calls in this. It'd be better to use the FmgrInfos in the relcache when applicable. However, mixing those with the cross-type ones would seem to require that we change the API for get_matching_hash_bounds et al from taking "FmgrInfo *" to taking "FmgrInfo **", which looks rather invasive. regards, tom lane diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c index 33513ff..4eeee7c 100644 *** a/src/backend/executor/execPartition.c --- b/src/backend/executor/execPartition.c *************** adjust_partition_tlist(List *tlist, Tupl *** 1357,1367 **** * * Functions: * ! * ExecSetupPartitionPruneState: * Creates the PartitionPruneState required by each of the two pruning * functions. Details stored include how to map the partition index * returned by the partition pruning code into subplan indexes. * * ExecFindInitialMatchingSubPlans: * Returns indexes of matching subplans. Partition pruning is attempted * without any evaluation of expressions containing PARAM_EXEC Params. --- 1357,1370 ---- * * Functions: * ! * ExecCreatePartitionPruneState: * Creates the PartitionPruneState required by each of the two pruning * functions. Details stored include how to map the partition index * returned by the partition pruning code into subplan indexes. * + * ExecDestroyPartitionPruneState: + * Deletes a PartitionPruneState. Must be called during executor shutdown. + * * ExecFindInitialMatchingSubPlans: * Returns indexes of matching subplans. Partition pruning is attempted * without any evaluation of expressions containing PARAM_EXEC Params. *************** adjust_partition_tlist(List *tlist, Tupl *** 1382,1389 **** */ /* ! * ExecSetupPartitionPruneState ! * Set up the data structure required for calling * ExecFindInitialMatchingSubPlans and ExecFindMatchingSubPlans. * * 'planstate' is the parent plan node's execution state. --- 1385,1392 ---- */ /* ! * ExecCreatePartitionPruneState ! * Build the data structure required for calling * ExecFindInitialMatchingSubPlans and ExecFindMatchingSubPlans. * * 'planstate' is the parent plan node's execution state. *************** adjust_partition_tlist(List *tlist, Tupl *** 1395,1401 **** * in each PartitionPruneInfo. */ PartitionPruneState * ! ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo) { PartitionPruneState *prunestate; PartitionPruningData *prunedata; --- 1398,1404 ---- * in each PartitionPruneInfo. */ PartitionPruneState * ! ExecCreatePartitionPruneState(PlanState *planstate, List *partitionpruneinfo) { PartitionPruneState *prunestate; PartitionPruningData *prunedata; *************** ExecSetupPartitionPruneState(PlanState * *** 1435,1445 **** PartitionPruningData *pprune = &prunedata[i]; PartitionPruneContext *context = &pprune->context; PartitionDesc partdesc; - Relation rel; PartitionKey partkey; - ListCell *lc2; int partnatts; int n_steps; /* * We must copy the subplan_map rather than pointing directly to the --- 1438,1447 ---- PartitionPruningData *pprune = &prunedata[i]; PartitionPruneContext *context = &pprune->context; PartitionDesc partdesc; PartitionKey partkey; int partnatts; int n_steps; + ListCell *lc2; /* * We must copy the subplan_map rather than pointing directly to the *************** ExecSetupPartitionPruneState(PlanState * *** 1456,1481 **** pprune->present_parts = bms_copy(pinfo->present_parts); /* ! * Grab some info from the table's relcache; lock was already obtained ! * by ExecLockNonLeafAppendTables. */ ! rel = relation_open(pinfo->reloid, NoLock); ! partkey = RelationGetPartitionKey(rel); ! partdesc = RelationGetPartitionDesc(rel); context->strategy = partkey->strategy; context->partnatts = partnatts = partkey->partnatts; ! context->partopfamily = partkey->partopfamily; ! context->partopcintype = partkey->partopcintype; context->partcollation = partkey->partcollation; context->partsupfunc = partkey->partsupfunc; ! context->nparts = pinfo->nparts; ! context->boundinfo = partition_bounds_copy(partdesc->boundinfo, partkey); context->planstate = planstate; /* Initialize expression state for each expression we need */ - n_steps = list_length(pinfo->pruning_steps); context->exprstates = (ExprState **) palloc0(sizeof(ExprState *) * n_steps * partnatts); foreach(lc2, pinfo->pruning_steps) --- 1458,1490 ---- pprune->present_parts = bms_copy(pinfo->present_parts); /* ! * We need to hold a pin on the partitioned table's relcache entry so ! * that we can rely on its copies of the table's partition key and ! * partition descriptor. We need not get a lock though; one should ! * have been acquired already by InitPlan or ! * ExecLockNonLeafAppendTables. */ ! context->partrel = relation_open(pinfo->reloid, NoLock); ! partkey = RelationGetPartitionKey(context->partrel); ! partdesc = RelationGetPartitionDesc(context->partrel); ! n_steps = list_length(pinfo->pruning_steps); context->strategy = partkey->strategy; context->partnatts = partnatts = partkey->partnatts; ! context->nparts = pinfo->nparts; ! context->boundinfo = partdesc->boundinfo; context->partcollation = partkey->partcollation; context->partsupfunc = partkey->partsupfunc; ! ! /* We'll look up type-specific support functions as needed */ ! context->stepcmpfuncs = (FmgrInfo *) ! palloc0(sizeof(FmgrInfo) * n_steps * partnatts); ! ! context->ppccontext = CurrentMemoryContext; context->planstate = planstate; /* Initialize expression state for each expression we need */ context->exprstates = (ExprState **) palloc0(sizeof(ExprState *) * n_steps * partnatts); foreach(lc2, pinfo->pruning_steps) *************** ExecSetupPartitionPruneState(PlanState * *** 1527,1534 **** prunestate->execparamids = bms_add_members(prunestate->execparamids, pinfo->execparamids); - relation_close(rel, NoLock); - i++; } --- 1536,1541 ---- *************** ExecSetupPartitionPruneState(PlanState * *** 1536,1541 **** --- 1543,1568 ---- } /* + * ExecDestroyPartitionPruneState + * Release resources at plan shutdown. + * + * We don't bother to free any memory here, since the whole executor context + * will be going away shortly. We do need to release our relcache pins. + */ + void + ExecDestroyPartitionPruneState(PartitionPruneState *prunestate) + { + int i; + + for (i = 0; i < prunestate->num_partprunedata; i++) + { + PartitionPruningData *pprune = &prunestate->partprunedata[i]; + + relation_close(pprune->context.partrel, NoLock); + } + } + + /* * ExecFindInitialMatchingSubPlans * Identify the set of subplans that cannot be eliminated by initial * pruning (disregarding any pruning constraints involving PARAM_EXEC diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c index 6dd53e9..5ce4fb4 100644 *** a/src/backend/executor/nodeAppend.c --- b/src/backend/executor/nodeAppend.c *************** ExecInitAppend(Append *node, EState *est *** 136,143 **** /* We may need an expression context to evaluate partition exprs */ ExecAssignExprContext(estate, &appendstate->ps); ! prunestate = ExecSetupPartitionPruneState(&appendstate->ps, ! node->part_prune_infos); /* Perform an initial partition prune, if required. */ if (prunestate->do_initial_prune) --- 136,145 ---- /* We may need an expression context to evaluate partition exprs */ ExecAssignExprContext(estate, &appendstate->ps); ! /* Create the working data structure for pruning. */ ! prunestate = ExecCreatePartitionPruneState(&appendstate->ps, ! node->part_prune_infos); ! appendstate->as_prune_state = prunestate; /* Perform an initial partition prune, if required. */ if (prunestate->do_initial_prune) *************** ExecInitAppend(Append *node, EState *est *** 178,185 **** */ if (!prunestate->do_exec_prune) appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1); - - appendstate->as_prune_state = prunestate; } else { --- 180,185 ---- *************** ExecEndAppend(AppendState *node) *** 330,335 **** --- 330,341 ---- */ for (i = 0; i < nplans; i++) ExecEndNode(appendplans[i]); + + /* + * release any resources associated with run-time pruning + */ + if (node->as_prune_state) + ExecDestroyPartitionPruneState(node->as_prune_state); } void diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c index 856bdd3..ee5826e 100644 *** a/src/backend/partitioning/partprune.c --- b/src/backend/partitioning/partprune.c *************** prune_append_rel_partitions(RelOptInfo * *** 436,449 **** if (contradictory) return NULL; context.strategy = rel->part_scheme->strategy; context.partnatts = rel->part_scheme->partnatts; - context.partopfamily = rel->part_scheme->partopfamily; - context.partopcintype = rel->part_scheme->partopcintype; - context.partcollation = rel->part_scheme->partcollation; - context.partsupfunc = rel->part_scheme->partsupfunc; context.nparts = rel->nparts; context.boundinfo = rel->boundinfo; /* These are not valid when being called from the planner */ context.planstate = NULL; --- 436,453 ---- if (contradictory) return NULL; + /* Set up PartitionPruneContext */ + context.partrel = NULL; context.strategy = rel->part_scheme->strategy; context.partnatts = rel->part_scheme->partnatts; context.nparts = rel->nparts; context.boundinfo = rel->boundinfo; + context.partcollation = rel->part_scheme->partcollation; + context.partsupfunc = rel->part_scheme->partsupfunc; + context.stepcmpfuncs = (FmgrInfo *) palloc0(sizeof(FmgrInfo) * + context.partnatts * + list_length(pruning_steps)); + context.ppccontext = CurrentMemoryContext; /* These are not valid when being called from the planner */ context.planstate = NULL; *************** perform_pruning_base_step(PartitionPrune *** 2809,2815 **** int keyno, nvalues; Datum values[PARTITION_MAX_KEYS]; ! FmgrInfo partsupfunc[PARTITION_MAX_KEYS]; /* * There better be the same number of expressions and compare functions. --- 2813,2820 ---- int keyno, nvalues; Datum values[PARTITION_MAX_KEYS]; ! FmgrInfo *partsupfunc; ! int stateidx; /* * There better be the same number of expressions and compare functions. *************** perform_pruning_base_step(PartitionPrune *** 2844,2850 **** if (lc1 != NULL) { Expr *expr; - int stateidx; Datum datum; bool isnull; --- 2849,2854 ---- *************** perform_pruning_base_step(PartitionPrune *** 2873,2891 **** return result; } ! /* ! * If we're going to need a different comparison function than ! * the one cached in the PartitionKey, we'll need to look up ! * the FmgrInfo. ! */ cmpfn = lfirst_oid(lc2); Assert(OidIsValid(cmpfn)); ! if (cmpfn != context->partsupfunc[keyno].fn_oid) ! fmgr_info(cmpfn, &partsupfunc[keyno]); ! else ! fmgr_info_copy(&partsupfunc[keyno], ! &context->partsupfunc[keyno], ! CurrentMemoryContext); values[keyno] = datum; nvalues++; --- 2877,2901 ---- return result; } ! /* Set up the stepcmpfuncs entry, unless we already did */ cmpfn = lfirst_oid(lc2); Assert(OidIsValid(cmpfn)); ! if (cmpfn != context->stepcmpfuncs[stateidx].fn_oid) ! { ! /* ! * If the needed support function is the same one cached ! * in the relation's partition key, copy the cached ! * FmgrInfo. Otherwise (i.e., when we have a cross-type ! * comparison), an actual lookup is required. ! */ ! if (cmpfn == context->partsupfunc[keyno].fn_oid) ! fmgr_info_copy(&context->stepcmpfuncs[stateidx], ! &context->partsupfunc[keyno], ! context->ppccontext); ! else ! fmgr_info_cxt(cmpfn, &context->stepcmpfuncs[stateidx], ! context->ppccontext); ! } values[keyno] = datum; nvalues++; *************** perform_pruning_base_step(PartitionPrune *** 2896,2901 **** --- 2906,2918 ---- } } + /* + * Point partsupfunc to the entry for the 0th key of this step; the + * additional support functions, if any, follow consecutively. + */ + stateidx = PruneCxtStateIdx(context->partnatts, opstep->step.step_id, 0); + partsupfunc = &context->stepcmpfuncs[stateidx]; + switch (context->strategy) { case PARTITION_STRATEGY_HASH: diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c index 3dfb1b8..dc69280 100644 *** a/src/backend/utils/cache/relcache.c --- b/src/backend/utils/cache/relcache.c *************** RelationClearRelation(Relation relation, *** 2401,2406 **** --- 2401,2407 ---- keep_tupdesc = equalTupleDescs(relation->rd_att, newrel->rd_att); keep_rules = equalRuleLocks(relation->rd_rules, newrel->rd_rules); keep_policies = equalRSDesc(relation->rd_rsdesc, newrel->rd_rsdesc); + /* partkey is immutable once set up, so we can always keep it */ keep_partkey = (relation->rd_partkey != NULL); keep_partdesc = equalPartitionDescs(relation->rd_partkey, relation->rd_partdesc, *************** RelationClearRelation(Relation relation, *** 2445,2451 **** SWAPFIELD(Form_pg_class, rd_rel); /* ... but actually, we don't have to update newrel->rd_rel */ memcpy(relation->rd_rel, newrel->rd_rel, CLASS_TUPLE_SIZE); ! /* preserve old tupledesc and rules if no logical change */ if (keep_tupdesc) SWAPFIELD(TupleDesc, rd_att); if (keep_rules) --- 2446,2452 ---- SWAPFIELD(Form_pg_class, rd_rel); /* ... but actually, we don't have to update newrel->rd_rel */ memcpy(relation->rd_rel, newrel->rd_rel, CLASS_TUPLE_SIZE); ! /* preserve old tupledesc, rules, policies if no logical change */ if (keep_tupdesc) SWAPFIELD(TupleDesc, rd_att); if (keep_rules) *************** RelationClearRelation(Relation relation, *** 2459,2471 **** SWAPFIELD(Oid, rd_toastoid); /* pgstat_info must be preserved */ SWAPFIELD(struct PgStat_TableStatus *, pgstat_info); ! /* partition key must be preserved, if we have one */ if (keep_partkey) { SWAPFIELD(PartitionKey, rd_partkey); SWAPFIELD(MemoryContext, rd_partkeycxt); } - /* preserve old partdesc if no logical change */ if (keep_partdesc) { SWAPFIELD(PartitionDesc, rd_partdesc); --- 2460,2471 ---- SWAPFIELD(Oid, rd_toastoid); /* pgstat_info must be preserved */ SWAPFIELD(struct PgStat_TableStatus *, pgstat_info); ! /* preserve old partitioning info if no logical change */ if (keep_partkey) { SWAPFIELD(PartitionKey, rd_partkey); SWAPFIELD(MemoryContext, rd_partkeycxt); } if (keep_partdesc) { SWAPFIELD(PartitionDesc, rd_partdesc); diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h index 71d639f..862bf65 100644 *** a/src/include/executor/execPartition.h --- b/src/include/executor/execPartition.h *************** extern HeapTuple ConvertPartitionTupleSl *** 208,215 **** TupleTableSlot **p_my_slot); extern void ExecCleanupTupleRouting(ModifyTableState *mtstate, PartitionTupleRouting *proute); ! extern PartitionPruneState *ExecSetupPartitionPruneState(PlanState *planstate, ! List *partitionpruneinfo); extern Bitmapset *ExecFindMatchingSubPlans(PartitionPruneState *prunestate); extern Bitmapset *ExecFindInitialMatchingSubPlans(PartitionPruneState *prunestate, int nsubplans); --- 208,216 ---- TupleTableSlot **p_my_slot); extern void ExecCleanupTupleRouting(ModifyTableState *mtstate, PartitionTupleRouting *proute); ! extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate, ! List *partitionpruneinfo); ! extern void ExecDestroyPartitionPruneState(PartitionPruneState *prunestate); extern Bitmapset *ExecFindMatchingSubPlans(PartitionPruneState *prunestate); extern Bitmapset *ExecFindInitialMatchingSubPlans(PartitionPruneState *prunestate, int nsubplans); diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h index e3b3bfb..09147b5 100644 *** a/src/include/partitioning/partprune.h --- b/src/include/partitioning/partprune.h *************** *** 20,68 **** /* * PartitionPruneContext * ! * Information about a partitioned table needed to perform partition pruning. */ typedef struct PartitionPruneContext { ! /* Partition key information */ char strategy; int partnatts; - Oid *partopfamily; - Oid *partopcintype; - Oid *partcollation; - FmgrInfo *partsupfunc; - - /* Number of partitions */ int nparts; - - /* Partition boundary info */ PartitionBoundInfo boundinfo; ! ! /* ! * This will be set when the context is used from the executor, to allow ! * Params to be evaluated. ! */ PlanState *planstate; - - /* - * Array of ExprStates, indexed as per PruneCtxStateIdx; one for each - * partkey in each pruning step. Allocated if planstate is non-NULL, - * otherwise NULL. - */ ExprState **exprstates; - - /* - * Similar array of flags, each true if corresponding 'exprstate' - * expression contains any PARAM_EXEC Params. (Can be NULL if planstate - * is NULL.) - */ bool *exprhasexecparam; - - /* true if it's safe to evaluate PARAM_EXEC Params */ bool evalexecparams; } PartitionPruneContext; #define PruneCxtStateIdx(partnatts, step_id, keyno) \ ((partnatts) * (step_id) + (keyno)) --- 20,76 ---- /* * PartitionPruneContext + * Stores information needed at runtime for pruning computations + * related to a single partitioned table. * ! * partrel Relcache pointer for the partitioned table, ! * if we have it open (else NULL). ! * strategy Partition strategy, e.g. LIST, RANGE, HASH. ! * partnatts Number of columns in the partition key. ! * nparts Number of partitions in this partitioned table. ! * boundinfo Partition boundary info for the partitioned table. ! * partcollation Array of partnatts elements, storing the collations of the ! * partition key columns. ! * partsupfunc Array of FmgrInfos for the comparison or hashing functions ! * associated with the partition keys (partnatts elements). ! * (This points into the partrel's partition key, typically.) ! * stepcmpfuncs Array of FmgrInfos for the comparison or hashing function ! * for each pruning step and partition key. ! * ppccontext Memory context holding this PartitionPruneContext's ! * subsidiary data, such as the FmgrInfos. ! * planstate Points to the parent plan node's PlanState when called ! * during execution; NULL when called from the planner. ! * exprstates Array of ExprStates, indexed as per PruneCtxStateIdx; one ! * for each partition key in each pruning step. Allocated if ! * planstate is non-NULL, otherwise NULL. ! * exprhasexecparam Array of bools, each true if corresponding 'exprstate' ! * expression contains any PARAM_EXEC Params. (Can be NULL ! * if planstate is NULL.) ! * evalexecparams True if it's safe to evaluate PARAM_EXEC Params. */ typedef struct PartitionPruneContext { ! Relation partrel; char strategy; int partnatts; int nparts; PartitionBoundInfo boundinfo; ! Oid *partcollation; ! FmgrInfo *partsupfunc; ! FmgrInfo *stepcmpfuncs; ! MemoryContext ppccontext; PlanState *planstate; ExprState **exprstates; bool *exprhasexecparam; bool evalexecparams; } PartitionPruneContext; + /* + * PruneCxtStateIdx() computes the correct index into the stepcmpfuncs[], + * exprstates[] and exprhasexecparam[] arrays for step step_id and + * partition key column keyno. (Note: there is code that assumes the + * entries for a given step are sequential, so this is not chosen freely.) + */ #define PruneCxtStateIdx(partnatts, step_id, keyno) \ ((partnatts) * (step_id) + (keyno))
On 13 June 2018 at 16:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It seems not to be that bad: we just need a shutdown call for the > PartitionPruneState, and then we can remember the open relation there. > The attached is based on David's patch from yesterday. > > I'm still a bit annoyed at the fmgr_info_copy calls in this. It'd be > better to use the FmgrInfos in the relcache when applicable. However, > mixing those with the cross-type ones would seem to require that we change > the API for get_matching_hash_bounds et al from taking "FmgrInfo *" to > taking "FmgrInfo **", which looks rather invasive. I've looked over this and it seems better than mine. Especially so that you've done things so that the FmgrInfo is copied into a memory context that's not about to get reset. One small thing is that I'd move the: context.partrel = NULL; to under: /* These are not valid when being called from the planner */ -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 13, 2018 at 12:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> As for whether to change it at this point in the release cycle, I >> guess that, to me, depends on how invasive the fix is. > > It seems not to be that bad: we just need a shutdown call for the > PartitionPruneState, and then we can remember the open relation there. > The attached is based on David's patch from yesterday. Seems reasonable. Really, I think we should look for a way to hang onto the relation at the point where it's originally opened and locked instead of reopening it here. But that's probably more invasive than we can really justify right at the moment, and I think this is a step in a good direction. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Seems reasonable. Really, I think we should look for a way to hang > onto the relation at the point where it's originally opened and locked > instead of reopening it here. But that's probably more invasive than > we can really justify right at the moment, and I think this is a step > in a good direction. The existing coding there makes me itch a bit, because there's only a rather fragile line of reasoning justifying the assumption that there is a pre-existing lock at all. So I'd be in favor of what you suggest just to get rid of the "open(NoLock)" hazard. But I agree that it'd be rather invasive and right now is probably not the time for it. regards, tom lane
David Rowley <david.rowley@2ndquadrant.com> writes: > On 13 June 2018 at 16:15, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> It seems not to be that bad: we just need a shutdown call for the >> PartitionPruneState, and then we can remember the open relation there. >> The attached is based on David's patch from yesterday. > I've looked over this and it seems better than mine. Especially so > that you've done things so that the FmgrInfo is copied into a memory > context that's not about to get reset. Pushed, thanks for looking it over. > One small thing is that I'd move the: > context.partrel = NULL; > to under: > /* These are not valid when being called from the planner */ Judgment call I guess. I tend to initialize struct fields in field order unless there's a good reason to do differently, just so it's easier to confirm that none were overlooked. But I can see the point of your suggestion too, so done that way. There's still one thing I'm a bit confused about here. I noticed that we weren't actually using the partopfamily and partopcintype fields in PartitionPruneContext, so I removed them. But that still leaves both partsupfunc and partcollation as pointers into the relcache that were subject to this hazard. My testing agrees with lousyjack's results that both of those were, in fact, being improperly accessed. The OID comparison effect I mentioned upthread explains why the buildfarm's cache-clobbering members failed to notice any problem with garbage partsupfunc data ... but why did we not see any complaints about invalid collation OIDs? Tis strange. regards, tom lane
On 2018/06/13 23:39, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Seems reasonable. Really, I think we should look for a way to hang >> onto the relation at the point where it's originally opened and locked >> instead of reopening it here. But that's probably more invasive than >> we can really justify right at the moment, and I think this is a step >> in a good direction. > > The existing coding there makes me itch a bit, because there's only a > rather fragile line of reasoning justifying the assumption that there is a > pre-existing lock at all. So I'd be in favor of what you suggest just to > get rid of the "open(NoLock)" hazard. But I agree that it'd be rather > invasive and right now is probably not the time for it. I had sent a patch to try to get rid of the open(NoLock) there a couple of months ago [1]. The idea was to both lock and open the relation in ExecNonLeafAppendTables, which is the first time all partitioned tables in a given Append node are locked for execution. Also, the patch makes it a responsibility of ExecEndAppend to release the relcache pins, so the recently added ExecDestroyPartitionPruneState would not be needed. Attached is a rebased version of that patch if there is interest in it. Thanks, Amit [1] https://www.postgresql.org/message-id/0b361a22-f995-e15c-a385-6d1b72dd0d13%40lab.ntt.co.jp
Attachment
On 14 June 2018 at 04:10, Tom Lane <tgl@sss.pgh.pa.us> wrote: > There's still one thing I'm a bit confused about here. I noticed that > we weren't actually using the partopfamily and partopcintype fields in > PartitionPruneContext, so I removed them. But that still leaves both > partsupfunc and partcollation as pointers into the relcache that were > subject to this hazard. My testing agrees with lousyjack's results > that both of those were, in fact, being improperly accessed. The OID > comparison effect I mentioned upthread explains why the buildfarm's > cache-clobbering members failed to notice any problem with garbage > partsupfunc data ... but why did we not see any complaints about invalid > collation OIDs? Tis strange. FWIW It's not working for me before e23bae82cf3 with CLOBBER_FREED_MEMORY, CLOBBER_CACHE_ALWAYS and RELCACHE_FORCE_RELEASE, and: create table listp (a text) partition by list(a); create table listp1 partition of listp for values in ('1'); select * from listp where a = (select '1'); I get: ERROR: cache lookup failed for collation 2139062143 -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 14 June 2018 at 19:17, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: > I had sent a patch to try to get rid of the open(NoLock) there a couple of > months ago [1]. The idea was to both lock and open the relation in > ExecNonLeafAppendTables, which is the first time all partitioned tables in > a given Append node are locked for execution. Also, the patch makes it a > responsibility of ExecEndAppend to release the relcache pins, so the > recently added ExecDestroyPartitionPruneState would not be needed. Robert and I briefly discussed something more complete a while back. Not much detail was talked about, but the basic idea was to store the Relation somewhere in the planner an executor that we could lookup by rel index rather than having to relation_open all the time. I just had a very quick look around and thought maybe RangeTblEntry might be a good spot to store the Relation and current lock level that we hold on that relation. This means that we can use PlannerInfo->simple_rte_array in the planner and EState->es_range_table in the executor. The latter of those would be much nicer if we built an array instead of keeping the list (can probably build that during InitPlan()). We could then get hold of a Relation object when needed without having to do relation_open(..., NoLock). Code which currently looks like: reloid = getrelid(scanrelid, estate->es_range_table); rel = heap_open(reloid, lockmode); In places like ExecOpenScanRelation, could be replaced with some wrapper function call like: rte_open_rel(estate, scanrelid, lockmode); This could also be used to ERROR out if rte_open_rel() was done initially with NoLock. Right now there are so many places that we do relation_open(..., NoLock) and write a comment /* Lock already taken by ... */, which we hope is always true. If the Relation is already populated in the RangeTblEntry then the function would just return that, otherwise, we'd just look inside the RangeTblEntry for the relation Oid and do a heap_open on it, and store the lockmode that's held. We could then also consider getting of a bunch of fields like boundinfo and nparts from RelOptInfo. We could also perhaps do a WARNING about lock upgrade hazards in there, at least maybe in debug builds. However, I only spent about 10 mins looking into this, there may be some giant holes in the idea. It would need much more research. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 14, 2018 at 7:23 AM, David Rowley <david.rowley@2ndquadrant.com> wrote: > However, I only spent about 10 mins looking into this, there may be > some giant holes in the idea. It would need much more research. It kind of flies in the face of the idea that a RangeTblEntry is just a node that can be freely copied around, serialized and deserialized, etc. I think it would be better to keep the pointer in the RelOptInfo in the planner and in the EState or PlanState in the executor. Those are things we don't think can be copied, serialized, etc. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jun 14, 2018 at 7:23 AM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> However, I only spent about 10 mins looking into this, there may be >> some giant holes in the idea. It would need much more research. > It kind of flies in the face of the idea that a RangeTblEntry is just > a node that can be freely copied around, serialized and deserialized, > etc. And also the idea that the Plan tree is read-only to the executor, which is not a good property to give up. > I think it would be better to keep the pointer in the RelOptInfo in > the planner and in the EState or PlanState in the executor. Those are > things we don't think can be copied, serialized, etc. Yeah, RelOptInfo seems like the natural place in the planner; we might need index relcache links in IndexOptInfo, too. I'm less sure what to do in the executor. We already do keep open relation pointers in PlanStates; the problem is just that it's node-type-specific (ss_currentRelation, iss_RelationDesc, etc). Perhaps that's unavoidable and we should just add more such fields as needed. regards, tom lane
On 2018/06/14 23:40, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Jun 14, 2018 at 7:23 AM, David Rowley >> <david.rowley@2ndquadrant.com> wrote: >>> However, I only spent about 10 mins looking into this, there may be >>> some giant holes in the idea. It would need much more research. > >> It kind of flies in the face of the idea that a RangeTblEntry is just >> a node that can be freely copied around, serialized and deserialized, >> etc. > > And also the idea that the Plan tree is read-only to the executor, > which is not a good property to give up. > >> I think it would be better to keep the pointer in the RelOptInfo in >> the planner and in the EState or PlanState in the executor. Those are >> things we don't think can be copied, serialized, etc. > > Yeah, RelOptInfo seems like the natural place in the planner; we might > need index relcache links in IndexOptInfo, too. > > I'm less sure what to do in the executor. We already do keep open > relation pointers in PlanStates; the problem is just that it's > node-type-specific (ss_currentRelation, iss_RelationDesc, etc). Perhaps > that's unavoidable and we should just add more such fields as needed. The patch I mentioned upthread maintains an array of Relation pointers in AppendState with as many members as there are in the partitioned_rels list that appears in the corresponding Append plan. I revised that patch a bit to rename ExecLockNonLeafAppendTables to ExecOpenAppendPartitionedTables to sound consistent with ExecOpenScanRelation et al. Thanks, Amit
Attachment
On 2018/06/14 20:23, David Rowley wrote: > On 14 June 2018 at 19:17, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote: >> I had sent a patch to try to get rid of the open(NoLock) there a couple of >> months ago [1]. The idea was to both lock and open the relation in >> ExecNonLeafAppendTables, which is the first time all partitioned tables in >> a given Append node are locked for execution. Also, the patch makes it a >> responsibility of ExecEndAppend to release the relcache pins, so the >> recently added ExecDestroyPartitionPruneState would not be needed. > > Robert and I briefly discussed something more complete a while back. > Not much detail was talked about, but the basic idea was to store the > Relation somewhere in the planner an executor that we could lookup by > rel index rather than having to relation_open all the time. > > I just had a very quick look around and thought maybe RangeTblEntry > might be a good spot to store the Relation and current lock level that > we hold on that relation. This means that we can use > PlannerInfo->simple_rte_array in the planner and > EState->es_range_table in the executor. The latter of those would be > much nicer if we built an array instead of keeping the list (can > probably build that during InitPlan()). We could then get hold of a > Relation object when needed without having to do relation_open(..., > NoLock). > > Code which currently looks like: > > reloid = getrelid(scanrelid, estate->es_range_table); > rel = heap_open(reloid, lockmode); > > In places like ExecOpenScanRelation, could be replaced with some > wrapper function call like: rte_open_rel(estate, scanrelid, lockmode); > > This could also be used to ERROR out if rte_open_rel() was done > initially with NoLock. Right now there are so many places that we do > relation_open(..., NoLock) and write a comment /* Lock already taken > by ... */, which we hope is always true. > > If the Relation is already populated in the RangeTblEntry then the > function would just return that, otherwise, we'd just look inside the > RangeTblEntry for the relation Oid and do a heap_open on it, and store > the lockmode that's held. We could then also consider getting of a > bunch of fields like boundinfo and nparts from RelOptInfo. > > We could also perhaps do a WARNING about lock upgrade hazards in > there, at least maybe in debug builds. > > However, I only spent about 10 mins looking into this, there may be > some giant holes in the idea. It would need much more research. Will also need to consider caching of plans, that is of PlannedStmts, which in turn contain RangeTblEntry nodes. The idea that we can open a relation at the beginning or even before planning and keep using the Relation pointer all the way to the end of execution of a query seems hard to realize. As others pointed out, we need to think of planning and execution as separate phases and will have to open/close relations separately. Thanks, Amit
On 06/12/2018 07:47 AM, Andrew Dunstan wrote: > > > On 06/11/2018 06:41 PM, Andrew Dunstan wrote: >> >> >> On 06/11/2018 06:24 PM, Tom Lane wrote: >>> >>> If we had any buildfarm critters running valgrind on >>> RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have >>> detected use of uninitialized memory here ... but I don't think we have >>> any. (The combination of valgrind and CCA would probably be too >>> slow to >>> be practical :-(, though maybe somebody with a fast machine could do >>> the other thing.) >>> >>> >> >> >> I don't have a fast machine, but I do have a slow machine already >> running valgrind and not doing much else :-) Let's see how lousyjack >> does with -DRELCACHE_FORCE_RELEASE >> >> > > > It added about 20% to the run time. That's tolerable, so I'll leave it > on. > > OK, lousyjack is back online with this, new and improved. It currently takes 7.5 hours for a run. Should I also add -DCATCACHE_FORCE_RELEASE? cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: > OK, lousyjack is back online with this, new and improved. It currently > takes 7.5 hours for a run. Should I also add -DCATCACHE_FORCE_RELEASE? I did some experimentation yesterday with valgrind plus both RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE. I didn't find any new bugs, but adding CATCACHE_FORCE_RELEASE makes things quite a lot slower :-(. Not sure it's worth it. regards, tom lane
On 06/15/2018 10:02 AM, Tom Lane wrote: > Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes: >> OK, lousyjack is back online with this, new and improved. It currently >> takes 7.5 hours for a run. Should I also add -DCATCACHE_FORCE_RELEASE? > I did some experimentation yesterday with valgrind plus both > RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE. I didn't find > any new bugs, but adding CATCACHE_FORCE_RELEASE makes things quite > a lot slower :-(. Not sure it's worth it. > > OK. cheers andrew -- Andrew Dunstan https://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > [ 0001-Open-partitioned-tables-during-Append-initialization.patch ] I took a look at this. While I'm in agreement with the general idea of holding open the partitioned relations' relcache entries throughout the query, I do not like anything about this patch: * It seems to be outright broken for the case that any of the partitioned relations are listed in nonleafResultRelations. If we're going to do it like this, we have to open every one of the partrels regardless of that. (I wonder whether we couldn't lose PlannedStmt.nonleafResultRelations altogether, in favor of merging the related code in InitPlan with this. That existing code is already a mighty ugly wart, and this patch makes it worse by adding new, related warts elsewhere.) * You've got *far* too much intimate knowledge of the possible callers in ExecOpenAppendPartitionedTables. Personally, what I would have this function do is return a List of the opened Relation pointers, and add a matching function to run through such a List and close the entries again, and make the callers responsible for stashing the List pointer in an appropriate field in their planstate. Or maybe what we should do is drop ExecLockNonLeafAppendTables/ ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it. regards, tom lane
On 2018/06/19 2:05, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> [ 0001-Open-partitioned-tables-during-Append-initialization.patch ] > > I took a look at this. While I'm in agreement with the general idea of > holding open the partitioned relations' relcache entries throughout the > query, I do not like anything about this patch: Thanks for taking a look at it and sorry about the delay in replying. > * It seems to be outright broken for the case that any of the partitioned > relations are listed in nonleafResultRelations. If we're going to do it > like this, we have to open every one of the partrels regardless of that. Yes, that was indeed wrong. > (I wonder whether we couldn't lose PlannedStmt.nonleafResultRelations > altogether, in favor of merging the related code in InitPlan with this. Hmm, PlannedStmt.nonleafResultRelations exists for the same reason as why PlannedStmt.resultRelations does, that is, /* * initialize result relation stuff, and open/lock the result rels. * * We must do this before initializing the plan tree, else we might try to * do a lock upgrade if a result rel is also a source rel. */ nonleafResultRelations contains members of partitioned_rels lists of all ModifyTable nodes contained in a plan. > That existing code is already a mighty ugly wart, and this patch makes > it worse by adding new, related warts elsewhere.) I just realized that there is a thinko in the following piece of code in ExecLockNonLeafAppendTables /* If this is a result relation, already locked in InitPlan */ foreach(l, stmt->nonleafResultRelations) { if (rti == lfirst_int(l)) { is_result_rel = true; break; } } It should actually be: /* If this is a result relation, already locked in InitPlan */ foreach(l, stmt->nonleafResultRelations) { Index nonleaf_rti = lfirst_int(l); Oid nonleaf_relid = getrelid(nonleaf_rti, estate->es_range_table); if (relid == nonleaf_relid) { is_result_rel = true; break; } } RT indexes in, say, Append.partitioned_rels, are distinct from those in PlannedStmt.nonleafResultRelations, so the existing test never succeeds, as also evident from the coverage report: https://coverage.postgresql.org/src/backend/executor/execUtils.c.gcov.html#864 I'm wondering if we couldn't just get rid of this code. If an input partitioned tables is indeed also a result relation, then we would've locked it in InitPlan with RowExclusiveLock and heap_opening it with a weaker lock (RowShare/AccessShare) wouldn't hurt. > * You've got *far* too much intimate knowledge of the possible callers > in ExecOpenAppendPartitionedTables. > > Personally, what I would have this function do is return a List of > the opened Relation pointers, and add a matching function to run through > such a List and close the entries again, and make the callers responsible > for stashing the List pointer in an appropriate field in their planstate. OK, I rewrote it to work that way. > Or maybe what we should do is drop ExecLockNonLeafAppendTables/ > ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it. Hmm, for InitPlan to do what ExecOpenAppendPartitionedTables does, we'd have to have all the partitioned tables (contained in partitioned_rels fields of all Append/MergeAppend/ModifyTable nodes of a plan) also listed in a global list like rootResultRelations and nonleafResultRelations of PlannedStmt. Attached updated patch. Thanks, Amit
Attachment
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2018/06/19 2:05, Tom Lane wrote: >> Or maybe what we should do is drop ExecLockNonLeafAppendTables/ >> ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it. > Hmm, for InitPlan to do what ExecOpenAppendPartitionedTables does, we'd > have to have all the partitioned tables (contained in partitioned_rels > fields of all Append/MergeAppend/ModifyTable nodes of a plan) also listed > in a global list like rootResultRelations and nonleafResultRelations of > PlannedStmt. > Attached updated patch. I've been looking at this patch, and while it's not unreasonable on its own terms, I'm growing increasingly distressed at the ad-hoc and rather duplicative nature of the data structures that have gotten stuck into plan trees thanks to partitioning (the rootResultRelations and nonleafResultRelations lists being prime examples). It struck me this morning that a whole lot of the complication here is solely due to needing to identify the right type of relation lock to take during executor startup, and that THAT WORK IS TOTALLY USELESS. In every case, we must already hold a suitable lock before we ever get to the executor; either one acquired during the parse/plan pipeline, or one re-acquired by AcquireExecutorLocks in the case of a cached plan. Otherwise it's entirely possible that the plan has been invalidated by concurrent DDL --- and it's not the executor's job to detect that and re-plan; that *must* have been done upstream. Moreover, it's important from a deadlock-avoidance standpoint that these locks get acquired in the same order as they were acquired during the initial parse/plan pipeline. I think it's reasonable to assume they were acquired in RTE order in that stage, so AcquireExecutorLocks is OK. But, if the current logic in the executor gets them in that order, it's both non-obvious that it does so and horribly fragile if it does, seeing that the responsibility for this is split across InitPlan, ExecOpenScanRelation, and ExecLockNonLeafAppendTables. So I'm thinking that what we really ought to do here is simplify executor startup to just open all rels with NoLock, and get rid of any supporting data structures that turn out to have no other use. (David Rowley's nearby patch to create a properly hierarchical executor data structure for partitioning info is likely to tie into this too, by removing some other vestigial uses of those lists.) I think that this idea has been discussed in the past, and we felt at the time that having the executor take its own locks was a good safety measure, and a relatively cheap one since the lock manager is pretty good at short-circuiting duplicative lock requests. But those are certainly not free. Moreover, I'm not sure that this is really a safety measure at all: if the executor were really taking any lock not already held, it'd be masking a DDL hazard. To investigate this further, I made the attached not-meant-for-commit hack to verify whether InitPlan and related executor startup functions were actually taking any not-previously-held locks. I could only find one such case: the parser always opens tables selected FOR UPDATE with RowShareLock, but if we end up implementing the resulting row mark with ROW_MARK_COPY, the executor is acquiring just AccessShareLock (because ExecOpenScanRelation thinks it needs to acquire some lock). The patch as presented hacks ExecOpenScanRelation to avoid that, and it passes check-world. What we'd be better off doing, if we go this route, is to install an assertion-build-only test that verifies during relation_open(NoLock) that some kind of lock is already held on the rel. That would protect not only the executor, but a boatload of existing places that open rels with NoLock on the currently-unverified assumption that a lock is already held. I'm also rather strongly tempted to add a field to RangeTblEntry that records the appropriate lock strength to take, so that we don't have to rely on keeping AcquireExecutorLocks' logic to decide on the lock type in sync with whatever the parse/plan pipeline does. (One could then imagine adding assertions in the executor that this field shows a lock strength of at least X, in place of actually opening the rel with X.) BTW, there'd be a lot to be said for having InitPlan just open all the rels and build an array of Relation pointers that parallels the RTE list, rather than doing heap_opens in random places elsewhere. regards, tom lane diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c index 8026fe2..58c62f3 100644 --- a/src/backend/executor/execMain.c +++ b/src/backend/executor/execMain.c @@ -817,6 +817,7 @@ InitPlan(QueryDesc *queryDesc, int eflags) TupleDesc tupType; ListCell *l; int i; + bool gotlock; /* * Do permissions checks @@ -852,7 +853,9 @@ InitPlan(QueryDesc *queryDesc, int eflags) Relation resultRelation; resultRelationOid = getrelid(resultRelationIndex, rangeTable); - resultRelation = heap_open(resultRelationOid, RowExclusiveLock); + gotlock = LockRelationOid(resultRelationOid, RowExclusiveLock); + Assert(!gotlock || IsParallelWorker()); + resultRelation = heap_open(resultRelationOid, NoLock); InitResultRelInfo(resultRelInfo, resultRelation, @@ -892,7 +895,9 @@ InitPlan(QueryDesc *queryDesc, int eflags) Relation resultRelDesc; resultRelOid = getrelid(resultRelIndex, rangeTable); - resultRelDesc = heap_open(resultRelOid, RowExclusiveLock); + gotlock = LockRelationOid(resultRelOid, RowExclusiveLock); + Assert(!gotlock || IsParallelWorker()); + resultRelDesc = heap_open(resultRelOid, NoLock); InitResultRelInfo(resultRelInfo, resultRelDesc, lfirst_int(l), @@ -912,8 +917,11 @@ InitPlan(QueryDesc *queryDesc, int eflags) /* We locked the roots above. */ if (!list_member_int(plannedstmt->rootResultRelations, resultRelIndex)) - LockRelationOid(getrelid(resultRelIndex, rangeTable), - RowExclusiveLock); + { + gotlock = LockRelationOid(getrelid(resultRelIndex, rangeTable), + RowExclusiveLock); + Assert(!gotlock || IsParallelWorker()); + } } } } @@ -963,10 +971,14 @@ InitPlan(QueryDesc *queryDesc, int eflags) case ROW_MARK_NOKEYEXCLUSIVE: case ROW_MARK_SHARE: case ROW_MARK_KEYSHARE: - relation = heap_open(relid, RowShareLock); + gotlock = LockRelationOid(relid, RowShareLock); + Assert(!gotlock || IsParallelWorker()); + relation = heap_open(relid, NoLock); break; case ROW_MARK_REFERENCE: - relation = heap_open(relid, AccessShareLock); + gotlock = LockRelationOid(relid, AccessShareLock); + Assert(!gotlock || IsParallelWorker()); + relation = heap_open(relid, NoLock); break; case ROW_MARK_COPY: /* no physical table access is required */ diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c index b963cae..cf08b50 100644 --- a/src/backend/executor/execUtils.c +++ b/src/backend/executor/execUtils.c @@ -42,6 +42,7 @@ #include "postgres.h" +#include "access/parallel.h" #include "access/relscan.h" #include "access/transam.h" #include "executor/executor.h" @@ -645,6 +646,7 @@ ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags) Relation rel; Oid reloid; LOCKMODE lockmode; + bool gotlock; /* * Determine the lock type we need. First, scan to see if target relation @@ -659,13 +661,19 @@ ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags) /* Keep this check in sync with InitPlan! */ ExecRowMark *erm = ExecFindRowMark(estate, scanrelid, true); - if (erm != NULL && erm->relation != NULL) + /* HACK: assume things are OK for ROW_MARK_COPY case */ + if (erm != NULL) lockmode = NoLock; } /* Open the relation and acquire lock as needed */ reloid = getrelid(scanrelid, estate->es_range_table); - rel = heap_open(reloid, lockmode); + if (lockmode != NoLock) + { + gotlock = LockRelationOid(reloid, lockmode); + Assert(!gotlock || IsParallelWorker()); + } + rel = heap_open(reloid, NoLock); /* * Complain if we're attempting a scan of an unscannable relation, except @@ -874,6 +882,7 @@ ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate) Index rti = lfirst_int(lc); bool is_result_rel = false; Oid relid = getrelid(rti, estate->es_range_table); + bool gotlock; /* If this is a result relation, already locked in InitPlan */ foreach(l, stmt->nonleafResultRelations) @@ -903,9 +912,10 @@ ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate) } if (rc && RowMarkRequiresRowShareLock(rc->markType)) - LockRelationOid(relid, RowShareLock); + gotlock = LockRelationOid(relid, RowShareLock); else - LockRelationOid(relid, AccessShareLock); + gotlock = LockRelationOid(relid, AccessShareLock); + Assert(!gotlock || IsParallelWorker()); } } } diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c index 7b2dcb6..a1ebf64 100644 --- a/src/backend/storage/lmgr/lmgr.c +++ b/src/backend/storage/lmgr/lmgr.c @@ -100,8 +100,9 @@ SetLocktagRelationOid(LOCKTAG *tag, Oid relid) * * Lock a relation given only its OID. This should generally be used * before attempting to open the relation's relcache entry. + * Return TRUE if we acquired a new lock, FALSE if already held. */ -void +bool LockRelationOid(Oid relid, LOCKMODE lockmode) { LOCKTAG tag; @@ -122,7 +123,11 @@ LockRelationOid(Oid relid, LOCKMODE lockmode) * CommandCounterIncrement, not here.) */ if (res != LOCKACQUIRE_ALREADY_HELD) + { AcceptInvalidationMessages(); + return true; + } + return false; } /* diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h index a217de9..69e6f7f 100644 --- a/src/include/storage/lmgr.h +++ b/src/include/storage/lmgr.h @@ -37,7 +37,7 @@ typedef enum XLTW_Oper extern void RelationInitLockInfo(Relation relation); /* Lock a relation */ -extern void LockRelationOid(Oid relid, LOCKMODE lockmode); +extern bool LockRelationOid(Oid relid, LOCKMODE lockmode); extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode); extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode); extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);
On Sun, Jul 15, 2018 at 1:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > What we'd be better off doing, if we go this route, is to install an > assertion-build-only test that verifies during relation_open(NoLock) > that some kind of lock is already held on the rel. That would protect > not only the executor, but a boatload of existing places that open > rels with NoLock on the currently-unverified assumption that a lock is > already held. +1. In fact, maybe we ought to go a little further and have a relation_reopen(oid, mode) that verifies that a lock in the specified mode is held. And then maybe we ought to go even further and start trying to get rid of all the places where we reopen already-opened relations. A distressing number of new patches add more places that do that, and while I try to push back on those, I think they are proliferating, and I think that they are not free. Granted, a hash table lookup is pretty cheap, but if you do a sufficient number of them in commonlt-taken code paths, it's got to cost something. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 16 July 2018 at 12:12, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Jul 15, 2018 at 1:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> What we'd be better off doing, if we go this route, is to install an >> assertion-build-only test that verifies during relation_open(NoLock) >> that some kind of lock is already held on the rel. That would protect >> not only the executor, but a boatload of existing places that open >> rels with NoLock on the currently-unverified assumption that a lock is >> already held. > > +1. In fact, maybe we ought to go a little further and have a > relation_reopen(oid, mode) that verifies that a lock in the specified > mode is held. Wouldn't it be better to just store the Relation indexed by its relid somewhere the first time we opened it? Then just do a direct array lookup on that rather than looking up by hashtable in syscache? -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
David Rowley <david.rowley@2ndquadrant.com> writes: > On 16 July 2018 at 12:12, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, Jul 15, 2018 at 1:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> What we'd be better off doing, if we go this route, is to install an >>> assertion-build-only test that verifies during relation_open(NoLock) >>> that some kind of lock is already held on the rel. > Wouldn't it be better to just store the Relation indexed by its relid > somewhere the first time we opened it? Then just do a direct array > lookup on that rather than looking up by hashtable in syscache? That would require carrying said array clear through from the parser to the executor, plus we'd have some fun keeping it in sync with the RTE changes that happen in the rewriter and planner, plus it's not entirely clear where we'd close those relations in cases where the generated plan isn't fed to the executor immediately (or gets copied in any way). I don't think it's worth going that far. I *do* think it might be worth behaving like that within the executor by itself, and said so upthread. In that situation, we have a clear place to do relation closes during ExecutorEnd, so it's not as messy as a more general approach would be. regards, tom lane
On Sun, Jul 15, 2018 at 8:24 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: >> +1. In fact, maybe we ought to go a little further and have a >> relation_reopen(oid, mode) that verifies that a lock in the specified >> mode is held. > > Wouldn't it be better to just store the Relation indexed by its relid > somewhere the first time we opened it? Then just do a direct array > lookup on that rather than looking up by hashtable in syscache? Yes, that would be better, IMHO anyway. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018/07/16 2:02, Tom Lane wrote: > Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: >> On 2018/06/19 2:05, Tom Lane wrote: >>> Or maybe what we should do is drop ExecLockNonLeafAppendTables/ >>> ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it. > >> Hmm, for InitPlan to do what ExecOpenAppendPartitionedTables does, we'd >> have to have all the partitioned tables (contained in partitioned_rels >> fields of all Append/MergeAppend/ModifyTable nodes of a plan) also listed >> in a global list like rootResultRelations and nonleafResultRelations of >> PlannedStmt. > >> Attached updated patch. > > I've been looking at this patch, and while it's not unreasonable on its > own terms, I'm growing increasingly distressed at the ad-hoc and rather > duplicative nature of the data structures that have gotten stuck into > plan trees thanks to partitioning (the rootResultRelations and > nonleafResultRelations lists being prime examples). > > It struck me this morning that a whole lot of the complication here is > solely due to needing to identify the right type of relation lock to take > during executor startup, and that THAT WORK IS TOTALLY USELESS. I agree. IIUC, that's also the reason why PlannedStmt had to grow a resultRelations field in the first place. > In every > case, we must already hold a suitable lock before we ever get to the > executor; either one acquired during the parse/plan pipeline, or one > re-acquired by AcquireExecutorLocks in the case of a cached plan. > Otherwise it's entirely possible that the plan has been invalidated by > concurrent DDL --- and it's not the executor's job to detect that and > re-plan; that *must* have been done upstream. > > Moreover, it's important from a deadlock-avoidance standpoint that these > locks get acquired in the same order as they were acquired during the > initial parse/plan pipeline. I think it's reasonable to assume they were > acquired in RTE order in that stage, so AcquireExecutorLocks is OK. But, > if the current logic in the executor gets them in that order, it's both > non-obvious that it does so and horribly fragile if it does, seeing that > the responsibility for this is split across InitPlan, > ExecOpenScanRelation, and ExecLockNonLeafAppendTables. > > So I'm thinking that what we really ought to do here is simplify executor > startup to just open all rels with NoLock, and get rid of any supporting > data structures that turn out to have no other use. (David Rowley's > nearby patch to create a properly hierarchical executor data structure for > partitioning info is likely to tie into this too, by removing some other > vestigial uses of those lists.) I agree it would be nice to be able to get rid of those lists. > I think that this idea has been discussed in the past, and we felt at > the time that having the executor take its own locks was a good safety > measure, and a relatively cheap one since the lock manager is pretty > good at short-circuiting duplicative lock requests. But those are > certainly not free. Moreover, I'm not sure that this is really a > safety measure at all: if the executor were really taking any lock > not already held, it'd be masking a DDL hazard. > > To investigate this further, I made the attached not-meant-for-commit > hack to verify whether InitPlan and related executor startup functions > were actually taking any not-previously-held locks. I could only find > one such case: the parser always opens tables selected FOR UPDATE with > RowShareLock, but if we end up implementing the resulting row mark > with ROW_MARK_COPY, the executor is acquiring just AccessShareLock > (because ExecOpenScanRelation thinks it needs to acquire some lock). > The patch as presented hacks ExecOpenScanRelation to avoid that, and > it passes check-world. > > What we'd be better off doing, if we go this route, is to install an > assertion-build-only test that verifies during relation_open(NoLock) > that some kind of lock is already held on the rel. That would protect > not only the executor, but a boatload of existing places that open > rels with NoLock on the currently-unverified assumption that a lock is > already held. +1 > I'm also rather strongly tempted to add a field to RangeTblEntry > that records the appropriate lock strength to take, so that we don't > have to rely on keeping AcquireExecutorLocks' logic to decide on the > lock type in sync with whatever the parse/plan pipeline does. (One > could then imagine adding assertions in the executor that this field > shows a lock strength of at least X, in place of actually opening > the rel with X.) > > BTW, there'd be a lot to be said for having InitPlan just open all > the rels and build an array of Relation pointers that parallels the > RTE list, rather than doing heap_opens in random places elsewhere. +1 to this. Actually I had written such a patch in place of the one I posted on this thread, but wasn't confident enough that I had found and fixed all the places in the executor that would use the Relation pointer from there instead of fetching it themselves. Thanks, Amit
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes: > On 2018/07/16 2:02, Tom Lane wrote: >> BTW, there'd be a lot to be said for having InitPlan just open all >> the rels and build an array of Relation pointers that parallels the >> RTE list, rather than doing heap_opens in random places elsewhere. > +1 to this. Actually I had written such a patch in place of the one I > posted on this thread, but wasn't confident enough that I had found and > fixed all the places in the executor that would use the Relation pointer > from there instead of fetching it themselves. Well, it doesn't have to be a 100% conversion, especially not to start out with. As long as we ensure that common code paths through the executor have only one open/close per rel, I don't think it matters too much if there are some corner cases with more. Also, tracking down places like that wouldn't be that hard to do. One could instrument the relcache to report situations where a rel's refcount gets above 1. (The infrastructure for logging backtraces that's being discussed in other threads would help here.) We might decide that particular cases weren't worth fixing --- eg if the extra open is happening in code that doesn't have easy access to the EState --- but I doubt that finding them is impractical. regards, tom lane