Thread: why partition pruning doesn't work?

why partition pruning doesn't work?

From
Pavel Stehule
Date:
Hi

CREATE TABLE data(a text, vlozeno date) PARTITION BY RANGE(vlozeno);
CREATE TABLE data_2016 PARTITION OF data FOR VALUES FROM ('2016-01-01') TO ('2016-12-31');
CREATE TABLE data_2017 PARTITION OF data FOR VALUES FROM ('2017-01-01') TO ('2017-12-31');
CREATE TABLE data_other PARTITION OF DATA DEFAULT;

insert into data select 'ahoj', '2016-01-01'::date + (random() * 900)::int from generate_series(1,1000000);
analyze data;

postgres=# explain analyze select * from data where vlozeno > '2018-06-01';
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Append (cost=0.00..3519.83 rows=20001 width=9) (actual time=0.042..27.750 rows=19428 loops=1) │
│ -> Seq Scan on data_other (cost=0.00..3419.83 rows=20001 width=9) (actual time=0.040..25.895 rows=19428 loops=1) │
│ Filter: (vlozeno > '2018-06-01'::date) │
│ Rows Removed by Filter: 171518 │
│ Planning Time: 0.766 ms │
│ Execution Time: 28.718 ms │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(6 rows)

postgres=# explain analyze select * from data where vlozeno > current_date;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
╞═════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Gather (cost=1000.00..17281.36 rows=20080 width=9) (actual time=0.749..95.389 rows=19428 loops=1) │
│ Workers Planned: 2 │
│ Workers Launched: 2 │
│ -> Parallel Append (cost=0.00..14273.36 rows=8367 width=9) (actual time=59.141..89.458 rows=6476 loops=3) │
│ -> Parallel Seq Scan on data_2016 (cost=0.00..5768.69 rows=24 width=9) (actual time=34.847..34.847 rows=0 loops=3) │
│ Filter: (vlozeno > CURRENT_DATE) │
│ Rows Removed by Filter: 135119 │
│ -> Parallel Seq Scan on data_2017 (cost=0.00..5745.02 rows=23 width=9) (actual time=53.269..53.269 rows=0 loops=2) │
│ Filter: (vlozeno > CURRENT_DATE) │
│ Rows Removed by Filter: 201848 │
│ -> Parallel Seq Scan on data_other (cost=0.00..2717.82 rows=11765 width=9) (actual time=0.044..55.502 rows=19428 loops=1) │
│ Filter: (vlozeno > CURRENT_DATE) │
│ Rows Removed by Filter: 171518 │
│ Planning Time: 0.677 ms │
│ Execution Time: 98.349 ms │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(15 rows)

but

postgres=# explain analyze select * from data where vlozeno > (select current_date);
┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                      QUERY PLAN                                                      │
╞══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ Append  (cost=0.01..19574.68 rows=333333 width=9) (actual time=0.095..31.945 rows=19428 loops=1)                     │
│   InitPlan 1 (returns $0)                                                                                            │
│     ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.010..0.010 rows=1 loops=1)                           │
│   ->  Seq Scan on data_2016  (cost=0.00..7258.98 rows=135119 width=9) (never executed)                               │
│         Filter: (vlozeno > $0)                                                                                       │
│   ->  Seq Scan on data_2017  (cost=0.00..7229.20 rows=134565 width=9) (never executed)                               │
│         Filter: (vlozeno > $0)                                                                                       │
│   ->  Seq Scan on data_other  (cost=0.00..3419.83 rows=63649 width=9) (actual time=0.069..29.856 rows=19428 loops=1) │
│         Filter: (vlozeno > $0)                                                                                       │
│         Rows Removed by Filter: 171518                                                                               │
│ Planning Time: 0.418 ms                                                                                              │
│ Execution Time: 33.019 ms                                                                                            │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(12 rows)


Partition pruning is working now.

Is it expected? Tested on fresh master.


Regards

Pavel





Re: why partition pruning doesn't work?

From
Dmitry Dolgov
Date:
> On 1 June 2018 at 07:19, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Partition pruning is working now.
>
> Is it expected? Tested on fresh master.

That's interesting. So there are two cases:

* vlozeno > (select current_date) (pruning works)

* vlozeno > current_date (pruning doesn't work)

In pull_partkey_params when we need to extract Params matching partition key in
the first case everything is fine, we've got an expr of type Param. In the
second case we've got a SQLValueFunction, which is ignored in the code - so
eventually we think that there is nothing matching a partition key and we don't
need to apply pruning.

With the attached hacky patch it would be taken into account (although I assume
in reality SQLValueFunction should be treated somehow differently) and pruning
is happening:

=# explain analyze select * from data where vlozeno > current_date;
                                                             QUERY PLAN


------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..17223.38 rows=19512 width=9) (actual
time=0.456..32.952 rows=19340 loop s=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Append  (cost=0.00..14272.18 rows=8130 width=9)
(actual time=0.042..26.616 rows =6447 loops=3)
         ->  Parallel Seq Scan on data_2016  (cost=0.00..5771.19
rows=24 width=9) (never executed)
               Filter: (vlozeno > CURRENT_DATE)
         ->  Parallel Seq Scan on data_2017  (cost=0.00..5747.65
rows=23 width=9) (never executed)
               Filter: (vlozeno > CURRENT_DATE)
         ->  Parallel Seq Scan on data_other  (cost=0.00..2712.69
rows=11431 width=9) (actual time=0.032..26.031 rows=6447 loops=3)
               Filter: (vlozeno > CURRENT_DATE)
               Rows Removed by Filter: 57084
 Planning Time: 1.256 ms
 Execution Time: 35.327 ms
(13 rows)

Time: 40.291 ms

Attachment

Re: why partition pruning doesn't work?

From
Ashutosh Bapat
Date:
On Fri, Jun 1, 2018 at 9:47 AM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> On 1 June 2018 at 07:19, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>>
>> Partition pruning is working now.
>>
>> Is it expected? Tested on fresh master.
>
> That's interesting. So there are two cases:
>
> * vlozeno > (select current_date) (pruning works)
>
> * vlozeno > current_date (pruning doesn't work)
>
> In pull_partkey_params when we need to extract Params matching partition key in
> the first case everything is fine, we've got an expr of type Param. In the
> second case we've got a SQLValueFunction, which is ignored in the code - so
> eventually we think that there is nothing matching a partition key and we don't
> need to apply pruning.
>
> With the attached hacky patch it would be taken into account (although I assume
> in reality SQLValueFunction should be treated somehow differently) and pruning
> is happening:

I think the patch is right if we were to handle only SQLValueFunction,
but the bigger picture here is that we aren't evaluating stable
functions before run-time partition pruning happens. I was under the
impression that the stable functions/expressions get evaluated and
folded into a constant just before the execution begins since a stable
function produces the same output for same input during one execution
invocation. But I am not able to find where we do that and probably we
don't do that at all. If we could do that then it's matter of using
same methods as plan-time partition pruning to prune the partitions.

If we go ahead with this patch, we should at least update it to handle
stable functions for the sake of completeness.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
> I think the patch is right if we were to handle only SQLValueFunction,
> but the bigger picture here is that we aren't evaluating stable
> functions before run-time partition pruning happens. I was under the
> impression that the stable functions/expressions get evaluated and
> folded into a constant just before the execution begins since a stable
> function produces the same output for same input during one execution
> invocation. But I am not able to find where we do that and probably we
> don't do that at all.

We don't; there was a patch floating around to make that happen, but
it hasn't been updated lately.

I agree though that it seems strange to special-case SQLValueFunction
rather than any-stable-expression.  As long as the evaluation happens
at executor start (i.e. with the query's run-time snapshot) it should
be reasonable to simplify any stable expression.

It's worth questioning whether this is a bug fix or an improvement.
If the latter, it probably ought to wait for v12.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Pavel Stehule
Date:


2018-06-01 17:53 GMT+02:00 Tom Lane <tgl@sss.pgh.pa.us>:
Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
> I think the patch is right if we were to handle only SQLValueFunction,
> but the bigger picture here is that we aren't evaluating stable
> functions before run-time partition pruning happens. I was under the
> impression that the stable functions/expressions get evaluated and
> folded into a constant just before the execution begins since a stable
> function produces the same output for same input during one execution
> invocation. But I am not able to find where we do that and probably we
> don't do that at all.

We don't; there was a patch floating around to make that happen, but
it hasn't been updated lately.

I agree though that it seems strange to special-case SQLValueFunction
rather than any-stable-expression.  As long as the evaluation happens
at executor start (i.e. with the query's run-time snapshot) it should
be reasonable to simplify any stable expression.

It's worth questioning whether this is a bug fix or an improvement.
If the latter, it probably ought to wait for v12.

The result is correct - but it was unpleasant surprise. I searched the most simple demo for this feature, and it doesn't work. Filtering based on CURRENT_DATE is often.

Regards

Pavel


                        regards, tom lane

Re: why partition pruning doesn't work?

From
Jeff Janes
Date:
On Fri, Jun 1, 2018 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

I agree though that it seems strange to special-case SQLValueFunction
rather than any-stable-expression.  As long as the evaluation happens
at executor start (i.e. with the query's run-time snapshot) it should
be reasonable to simplify any stable expression.

It's worth questioning whether this is a bug fix or an improvement.
If the latter, it probably ought to wait for v12.

If explaining the change requires reference to tokens from the source code, rather than something an end user could understand, I'd argue it is a bug fix rather than an improvement.

Cheers,

Jeff

Re: why partition pruning doesn't work?

From
Ashutosh Bapat
Date:
On Sat, Jun 2, 2018 at 5:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Fri, Jun 1, 2018 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>>
>> I agree though that it seems strange to special-case SQLValueFunction
>> rather than any-stable-expression.  As long as the evaluation happens
>> at executor start (i.e. with the query's run-time snapshot) it should
>> be reasonable to simplify any stable expression.
>>
>> It's worth questioning whether this is a bug fix or an improvement.
>> If the latter, it probably ought to wait for v12.
>
>
> If explaining the change requires reference to tokens from the source code,
> rather than something an end user could understand, I'd argue it is a bug
> fix rather than an improvement.

If we going to implement stable expression folding before the actual
execution starts, that's a feature in itself. So, it's V12 material.
Partition pruning will use that feature. I don't think we should make
partition pruning work with stable expressions in some ad-hoc way in
V11 and the some future release (mostly V12) implements it on top of
stable expression folding feature. So my vote for making it work in
V12.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Jeff Janes <jeff.janes@gmail.com> writes:
> On Fri, Jun 1, 2018 at 11:53 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's worth questioning whether this is a bug fix or an improvement.
>> If the latter, it probably ought to wait for v12.

> If explaining the change requires reference to tokens from the source code,
> rather than something an end user could understand, I'd argue it is a bug
> fix rather than an improvement.

Well, the difference between volatile, stable and immutable functions is
well-documented, so I don't think that's a great argument.  If there's
some aspect of this behavior that's not predictable from understanding
which class the partition key expression falls into, then I could agree
that's a bug.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Dmitry Dolgov
Date:
> On 1 June 2018 at 17:53, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> writes:
>> I think the patch is right if we were to handle only SQLValueFunction,
>> but the bigger picture here is that we aren't evaluating stable
>> functions before run-time partition pruning happens.
>
> I agree though that it seems strange to special-case SQLValueFunction
> rather than any-stable-expression.  As long as the evaluation happens
> at executor start (i.e. with the query's run-time snapshot) it should
> be reasonable to simplify any stable expression.

Just to clarify for myself, for evaluating any stable function here would it be
enough to handle all function-like expressions (FuncExpr / OpExpr /
DistinctExpr / NullIfExpr) and check a corresponding function for provolatile,
like in the attached patch?

Attachment

Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Dmitry Dolgov <9erthalion6@gmail.com> writes:
> Just to clarify for myself, for evaluating any stable function here would it be
> enough to handle all function-like expressions (FuncExpr / OpExpr /
> DistinctExpr / NullIfExpr) and check a corresponding function for provolatile,
> like in the attached patch?

I think the entire approach is wrong here.  Rather than concerning
yourself with Params, or any other specific expression type, you
should be using !contain_volatile_functions() to decide whether
an expression is run-time-constant.  If it is, use the regular
expression evaluation machinery to extract the value.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Dmitry Dolgov
Date:
> On 3 June 2018 at 19:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Dmitry Dolgov <9erthalion6@gmail.com> writes:
>> Just to clarify for myself, for evaluating any stable function here would it be
>> enough to handle all function-like expressions (FuncExpr / OpExpr /
>> DistinctExpr / NullIfExpr) and check a corresponding function for provolatile,
>> like in the attached patch?
>
> I think the entire approach is wrong here.  Rather than concerning
> yourself with Params, or any other specific expression type, you
> should be using !contain_volatile_functions() to decide whether
> an expression is run-time-constant.  If it is, use the regular
> expression evaluation machinery to extract the value.

Yes, it makes sense. Then, to my understanding, the attached code is close to
what was described above (although I'm not sure about the Const part).

Attachment

Re: why partition pruning doesn't work?

From
Amit Langote
Date:
Hi Dmitry,

Thanks for creating the patch.  I looked at it and have some comments.

On 2018/06/04 22:30, Dmitry Dolgov wrote:
>> On 3 June 2018 at 19:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Dmitry Dolgov <9erthalion6@gmail.com> writes:
>>> Just to clarify for myself, for evaluating any stable function here would it be
>>> enough to handle all function-like expressions (FuncExpr / OpExpr /
>>> DistinctExpr / NullIfExpr) and check a corresponding function for provolatile,
>>> like in the attached patch?
>>
>> I think the entire approach is wrong here.  Rather than concerning
>> yourself with Params, or any other specific expression type, you
>> should be using !contain_volatile_functions() to decide whether
>> an expression is run-time-constant.  If it is, use the regular
>> expression evaluation machinery to extract the value.
> 
> Yes, it makes sense. Then, to my understanding, the attached code is close to
> what was described above (although I'm not sure about the Const part).

This:

@@ -2727,6 +2730,13 @@ pull_partkey_params(PartitionPruneInfo *pinfo, List
*steps)
                 }
                 gotone = true;
             }
+            else if (!IsA(expr, Const))
+            {
+                Param   *param = (Param *) expr;
+                pinfo->execparams = bms_add_member(pinfo->execparams,
+                                                   param->paramid);
+                gotone = true;
+            }

doesn't look quite right.  What says expr is really a Param?  The patch
appears to work because, by setting pinfo->execparams to *something*, it
triggers execution-time pruning to run; its contents aren't necessarily
used during execution pruning.  In fact, it would've crashed if the
execution-time pruning code had required execparams to contain *valid*
param id, but currently it doesn't.

What I think we'd need to do to make this work is to make execution-time
pruning be invoked even if there aren't any Params involved.  IOW, let's
try to teach make_partition_pruneinfo that it can go ahead also in the
cases where there are expressions being compared with the partition key
that contain (only) stable functions.  Then, go and fix the
execution-pruning code to not *always* expect there to be Params to prune
with.

Maybe, David (added to cc) has something to say about all this, especially
whether he considers this a feature and not a bug fix.

Thanks,
Amit



Re: why partition pruning doesn't work?

From
Dmitry Dolgov
Date:
> On 5 June 2018 at 12:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>
> doesn't look quite right.  What says expr is really a Param?  The patch
> appears to work because, by setting pinfo->execparams to *something*, it
> triggers execution-time pruning to run; its contents aren't necessarily
> used during execution pruning.  In fact, it would've crashed if the
> execution-time pruning code had required execparams to contain *valid*
> param id, but currently it doesn't.
>
> What I think we'd need to do to make this work is to make execution-time
> pruning be invoked even if there aren't any Params involved.  IOW, let's
> try to teach make_partition_pruneinfo that it can go ahead also in the
> cases where there are expressions being compared with the partition key
> that contain (only) stable functions.  Then, go and fix the
> execution-pruning code to not *always* expect there to be Params to prune
> with.

Yeah, I agree - I copied this approach mindlessly from the original hacky
patch. So, looks like it's necessary to have something like got_stable_expr
together with gotparam. And after that the only place where I see Params
are in use is partkey_datum_from_expr where all the stuff is actually
evaluated. So apparently this part about "fix the execution-pruning code to not
*always* expect there to be Params to prune with" will be only about this
function - am I correct or there is something else that I missed?


Re: why partition pruning doesn't work?

From
Ashutosh Bapat
Date:
On Tue, Jun 5, 2018 at 6:24 PM, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> On 5 June 2018 at 12:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>>
>> doesn't look quite right.  What says expr is really a Param?  The patch
>> appears to work because, by setting pinfo->execparams to *something*, it
>> triggers execution-time pruning to run; its contents aren't necessarily
>> used during execution pruning.  In fact, it would've crashed if the
>> execution-time pruning code had required execparams to contain *valid*
>> param id, but currently it doesn't.
>>
>> What I think we'd need to do to make this work is to make execution-time
>> pruning be invoked even if there aren't any Params involved.  IOW, let's
>> try to teach make_partition_pruneinfo that it can go ahead also in the
>> cases where there are expressions being compared with the partition key
>> that contain (only) stable functions.  Then, go and fix the
>> execution-pruning code to not *always* expect there to be Params to prune
>> with.
>
> Yeah, I agree - I copied this approach mindlessly from the original hacky
> patch. So, looks like it's necessary to have something like got_stable_expr
> together with gotparam.

I think the current code is heavily relying on Params to be present
for partition pruning, which isn't true. Runtime partition pruning is
possible when there are comparison conditions with partition key
expressions on one side and "execution time constant" expressions on
the other side. By "execution time constant" expression, I mean any
expression that evaluates to a constant at the time of execution like
a stable expressions (not just functions) or a Param expression. I can
think of only these two at this time, but there can be more. So,
gotparam should be renamed as "gotprunable_cond" to be generic.
pull_partkey_params() should be renamed as "pull_partkey_conds" or
something generic. That function would return true if there exists an
expression in steps which can be evaluated to a constant at runtime,
otherwise it returns false. My guess is there will be false-positives
which need to be dealt with later, but there will be no
false-negatives.

> And after that the only place where I see Params
> are in use is partkey_datum_from_expr where all the stuff is actually
> evaluated. So apparently this part about "fix the execution-pruning code to not
> *always* expect there to be Params to prune with" will be only about this
> function - am I correct or there is something else that I missed?

Yes. But I think trying to evaluate parameters in this function is not
good. The approach of folding constant expressions before or
immediately after the execution starts doesn't require the expressions
to be evaluated in partkey_datum_from_expr and might benefit other
places where stable expressions or params can appear.

Other problem with partkey_datum_from_expr() seems to be that it
evaluated only param nodes but not the expressions involving
parameters which can folded into constants at runtime. Take for
example following queries on table t1 with two partitions (0, 100) and
(100, 200), populated using "insert into t1 select i, i from
generate_series(0, 199) i;". There's an index on t1(a).

explain analyze select * from t1 x left join t1 y on x.a = y.b where y.a = 5;
                                                 QUERY PLAN
------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..6.78 rows=1 width=16) (actual
time=0.033..0.066 rows=1 loops=1)
   ->  Append  (cost=0.00..2.25 rows=1 width=8) (actual
time=0.019..0.035 rows=1 loops=1)
         ->  Seq Scan on t1p1 y  (cost=0.00..2.25 rows=1 width=8)
(actual time=0.018..0.035 rows=1 loops=1)
               Filter: (a = 5)
               Rows Removed by Filter: 99
   ->  Append  (cost=0.00..4.51 rows=2 width=8) (actual
time=0.011..0.027 rows=1 loops=1)
         ->  Seq Scan on t1p1 x  (cost=0.00..2.25 rows=1 width=8)
(actual time=0.006..0.022 rows=1 loops=1)
               Filter: (y.b = a)
               Rows Removed by Filter: 99
         ->  Seq Scan on t1p2 x_1  (cost=0.00..2.25 rows=1 width=8)
(never executed)
               Filter: (y.b = a)
 Planning Time: 0.644 ms
 Execution Time: 0.115 ms
(13 rows)

t1p2 x_1 is never scanned indicating that run time partition pruning
happened. But then see the following query

postgres:17889=#explain analyze select * from t1 x left join t1 y on
x.a = y.b + 100 where y.a = 5;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..7.28 rows=1 width=16) (actual
time=0.055..0.093 rows=1 loops=1)
   ->  Append  (cost=0.00..2.25 rows=1 width=8) (actual
time=0.017..0.034 rows=1 loops=1)
         ->  Seq Scan on t1p1 y  (cost=0.00..2.25 rows=1 width=8)
(actual time=0.016..0.033 rows=1 loops=1)
               Filter: (a = 5)
               Rows Removed by Filter: 99
   ->  Append  (cost=0.00..5.01 rows=2 width=8) (actual
time=0.034..0.054 rows=1 loops=1)
         ->  Seq Scan on t1p1 x  (cost=0.00..2.50 rows=1 width=8)
(actual time=0.026..0.026 rows=0 loops=1)
               Filter: ((y.b + 100) = a)
               Rows Removed by Filter: 100
         ->  Seq Scan on t1p2 x_1  (cost=0.00..2.50 rows=1 width=8)
(actual time=0.007..0.027 rows=1 loops=1)
               Filter: ((y.b + 100) = a)
               Rows Removed by Filter: 99
 Planning Time: 0.424 ms
 Execution Time: 0.139 ms
(14 rows)

The scan on t1p1 x returns no rows and should have been pruned since
y.b + 100 is constant for a given y.b.

But for this to work, folding constant expressions doesn't help since
y.b changes with every rescan of t1 x. So may be we need some way to
constant fold expression during ExecutorRewind() as well.

This is digression from the original report, but it's still within the
scope of "why partition pruning doesn't work?"

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 5 June 2018 at 22:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Maybe, David (added to cc) has something to say about all this, especially
> whether he considers this a feature and not a bug fix.

Thanks, Amit. I had missed this thread.

Yeah. I admit if I'd thought about this case when I wrote the code,
then I'd have made any non-volatile Expr work, but I didn't :-(

It was pointed out to be a few months ago in a comment in [1]. I
initially thought that this was v12 material, but it seems there are a
few people here that are pretty unhappy about it.

I was going to describe what such a patch should look like here, but
that seemed like about as much work as writing it, so:

Please see the attached patch. I've only just finished with it and
it's not fully done yet as there's still an XXX comment where I've not
quite thought about Exprs with Vars from higher levels. These might
always be converted to Params, so the code might be okay as is, but
I've not checked this yet, hence the comment remains.

I'm slightly leaning towards this being included in v11. Without this
people are forced into hacks like WHERE partkey = (SELECT
stablefunc()); to get pruning working at all. If that SQL remains
after this patch then pruning can only take place during actual
execution. With the attached patch the pruning can take place during
the initialization of the executor, which in cases with many
partitions can be significantly faster, providing actual execution is
short.  I'd rather people didn't get into bad habits like that if we
can avoid it.

[1] https://blog.2ndquadrant.com/partition-elimination-postgresql-11/

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: why partition pruning doesn't work?

From
Pavel Stehule
Date:


2018-06-05 17:07 GMT+02:00 David Rowley <david.rowley@2ndquadrant.com>:
On 5 June 2018 at 22:31, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Maybe, David (added to cc) has something to say about all this, especially
> whether he considers this a feature and not a bug fix.

Thanks, Amit. I had missed this thread.

Yeah. I admit if I'd thought about this case when I wrote the code,
then I'd have made any non-volatile Expr work, but I didn't :-(

It was pointed out to be a few months ago in a comment in [1]. I
initially thought that this was v12 material, but it seems there are a
few people here that are pretty unhappy about it.

I was going to describe what such a patch should look like here, but
that seemed like about as much work as writing it, so:

Please see the attached patch. I've only just finished with it and
it's not fully done yet as there's still an XXX comment where I've not
quite thought about Exprs with Vars from higher levels. These might
always be converted to Params, so the code might be okay as is, but
I've not checked this yet, hence the comment remains.

I'm slightly leaning towards this being included in v11. Without this
people are forced into hacks like WHERE partkey = (SELECT
stablefunc()); to get pruning working at all. If that SQL remains
after this patch then pruning can only take place during actual
execution. With the attached patch the pruning can take place during
the initialization of the executor, which in cases with many
partitions can be significantly faster, providing actual execution is
short.  I'd rather people didn't get into bad habits like that if we
can avoid it.

This is really great

Regards

Pavel


[1] https://blog.2ndquadrant.com/partition-elimination-postgresql-11/

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 6 June 2018 at 03:07, David Rowley <david.rowley@2ndquadrant.com> wrote:
> Please see the attached patch. I've only just finished with it and
> it's not fully done yet as there's still an XXX comment where I've not
> quite thought about Exprs with Vars from higher levels. These might
> always be converted to Params, so the code might be okay as is, but
> I've not checked this yet, hence the comment remains.

I looked at this again today and decided that the XXX comment could
just be removed. I also changed contains_only_safeparams into
contains_unsafeparams and reversed the condition. I then decided that
I didn't like the way we need to check which params are in the Expr
each time we call partkey_datum_from_expr. It seems better to prepare
this in advance when building the pruning steps. I started work on
that, but soon realised that I'd need to pass a List of Bitmapsets to
the executor. This is a problem as Bitmapset is not a Node type and
cannot be copied with COPY_NODE_FIELD(). Probably this could be
refactored to instead of passing 3 Lists in the PartitionPruneStepOp
we could invent a new node type that just has 3 fields and store a
single List.

Anyway, I didn't do that because I'm not sure what the fate of this
patch is going to be. I'd offer to change things around to add a new
Node type, but I don't really want to work on that now if this is v12
material.

I've attached a cleaned up version of the patch I posted yesterday.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: why partition pruning doesn't work?

From
Amit Langote
Date:
On 2018/06/06 14:10, David Rowley wrote:
> I then decided that
> I didn't like the way we need to check which params are in the Expr
> each time we call partkey_datum_from_expr. It seems better to prepare
> this in advance when building the pruning steps. I started work on
> that, but soon realised that I'd need to pass a List of Bitmapsets to
> the executor. This is a problem as Bitmapset is not a Node type and
> cannot be copied with COPY_NODE_FIELD(). Probably this could be
> refactored to instead of passing 3 Lists in the PartitionPruneStepOp
> we could invent a new node type that just has 3 fields and store a
> single List.

I wonder why we need to create those Bitmapsets in the planner?  Why not
in ExecSetupPartitionPruneState()?  For example, like how
context->exprstates is initialized.

Thanks,
Amit



Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 6 June 2018 at 18:05, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2018/06/06 14:10, David Rowley wrote:
>> I then decided that
>> I didn't like the way we need to check which params are in the Expr
>> each time we call partkey_datum_from_expr. It seems better to prepare
>> this in advance when building the pruning steps. I started work on
>> that, but soon realised that I'd need to pass a List of Bitmapsets to
>> the executor. This is a problem as Bitmapset is not a Node type and
>> cannot be copied with COPY_NODE_FIELD(). Probably this could be
>> refactored to instead of passing 3 Lists in the PartitionPruneStepOp
>> we could invent a new node type that just has 3 fields and store a
>> single List.
>
> I wonder why we need to create those Bitmapsets in the planner?  Why not
> in ExecSetupPartitionPruneState()?  For example, like how
> context->exprstates is initialized.

That seems like a good idea.  Certainly much better than working them
out each time we prune.

v3 patch attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: why partition pruning doesn't work?

From
Amit Langote
Date:
On 2018/06/06 18:52, David Rowley wrote:
> On 6 June 2018 at 18:05, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> On 2018/06/06 14:10, David Rowley wrote:
>>> I then decided that
>>> I didn't like the way we need to check which params are in the Expr
>>> each time we call partkey_datum_from_expr. It seems better to prepare
>>> this in advance when building the pruning steps. I started work on
>>> that, but soon realised that I'd need to pass a List of Bitmapsets to
>>> the executor. This is a problem as Bitmapset is not a Node type and
>>> cannot be copied with COPY_NODE_FIELD(). Probably this could be
>>> refactored to instead of passing 3 Lists in the PartitionPruneStepOp
>>> we could invent a new node type that just has 3 fields and store a
>>> single List.
>>
>> I wonder why we need to create those Bitmapsets in the planner?  Why not
>> in ExecSetupPartitionPruneState()?  For example, like how
>> context->exprstates is initialized.
> 
> That seems like a good idea.  Certainly much better than working them
> out each time we prune.
> 
> v3 patch attached.

Thanks David.  This one looks good.  I also like it that hasparamlessexprs
is no longer determined and set in the planner.

I checked what happens with the cases that Ashutosh complained about
upthread and seems that the pruning works as expected.

create table t1 (a int, b int) partition by range (a);
create table t1p1 partition of t1 for values from (0) to (100);
create table t1p2 partition of t1 for values from (100) to (200);
create index on t1 (a);
insert into t1 select i, i from generate_series(0, 199) i;

explain (costs off, analyze) select * from t1 x left join t1 y on x.a =
y.b + 100 where y.a = 5;
                                          QUERY PLAN

-----------------------------------------------------------------------------------------------
 Nested Loop (actual time=0.294..0.371 rows=1 loops=1)
   ->  Append (actual time=0.067..0.092 rows=1 loops=1)
         ->  Bitmap Heap Scan on t1p1 y (actual time=0.049..0.059 rows=1
loops=1)
               Recheck Cond: (a = 5)
               Heap Blocks: exact=1
               ->  Bitmap Index Scan on t1p1_a_idx (actual
time=0.022..0.022 rows=1 loops=1)
                     Index Cond: (a = 5)
   ->  Append (actual time=0.192..0.219 rows=1 loops=1)
         ->  Index Scan using t1p1_a_idx on t1p1 x (never executed)
               Index Cond: (a = (y.b + 100))
         ->  Index Scan using t1p2_a_idx on t1p2 x_1 (actual
time=0.134..0.145 rows=1 loops=1)
               Index Cond: (a = (y.b + 100))
 Planning Time: 5.314 ms
 Execution Time: 0.938 ms
(14 rows)

Note that the condition x.a = y.b + 100 is able to prune t1p1, whereas on
HEAD it isn't.

Thanks,
Amit



Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 7 June 2018 at 14:51, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> Thanks David.  This one looks good.  I also like it that hasparamlessexprs
> is no longer determined and set in the planner.

Thanks for checking it.

> I checked what happens with the cases that Ashutosh complained about
> upthread and seems that the pruning works as expected.

[...]

> explain (costs off, analyze) select * from t1 x left join t1 y on x.a =
> y.b + 100 where y.a = 5;

Yeah, I added a test to partition_prune.sql that verifies a similar case.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: why partition pruning doesn't work?

From
Ashutosh Bapat
Date:
On Thu, Jun 7, 2018 at 8:51 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> On 7 June 2018 at 14:51, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> Thanks David.  This one looks good.  I also like it that hasparamlessexprs
>> is no longer determined and set in the planner.
>
> Thanks for checking it.
>
>> I checked what happens with the cases that Ashutosh complained about
>> upthread and seems that the pruning works as expected.
>
> [...]
>
>> explain (costs off, analyze) select * from t1 x left join t1 y on x.a =
>> y.b + 100 where y.a = 5;
>
> Yeah, I added a test to partition_prune.sql that verifies a similar case.

Thanks for taking care of that.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On 6 June 2018 at 18:05, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I wonder why we need to create those Bitmapsets in the planner?  Why not
>> in ExecSetupPartitionPruneState()?  For example, like how
>> context->exprstates is initialized.

> That seems like a good idea.  Certainly much better than working them
> out each time we prune.

> v3 patch attached.

Maybe there's something I'm missing here, but I sort of hoped that this
patch would nuke all the special-case code for Params in this area.
Why is there any need to distinguish them from other stable expressions?

IOW, I was hoping for the code to end up simpler, not more complicated.

            regards, tom lane


Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 8 June 2018 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe there's something I'm missing here, but I sort of hoped that this
> patch would nuke all the special-case code for Params in this area.
> Why is there any need to distinguish them from other stable expressions?
>
> IOW, I was hoping for the code to end up simpler, not more complicated.

We need to know which Params exist in the Expr as if there are no
Params, or only external Params, then we can run-time prune during
startup of the executor. Otherwise, we must leave the pruning until
during execution.

I really don't want to say goodbye to that optimisation as it's a
significant win to save having to initialise the subnodes for all the
useless partitions for OLTP type queries.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On 8 June 2018 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Maybe there's something I'm missing here, but I sort of hoped that this
>> patch would nuke all the special-case code for Params in this area.
>> Why is there any need to distinguish them from other stable expressions?

> We need to know which Params exist in the Expr as if there are no
> Params, or only external Params, then we can run-time prune during
> startup of the executor.

This does not refute my question.  Why doesn't the same logic apply
to any stable expression?  That is, ISTM a Param is a special case
of that.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Ashutosh Bapat
Date:
On Fri, Jun 8, 2018 at 8:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> On 8 June 2018 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Maybe there's something I'm missing here, but I sort of hoped that this
>>> patch would nuke all the special-case code for Params in this area.
>>> Why is there any need to distinguish them from other stable expressions?
>
>> We need to know which Params exist in the Expr as if there are no
>> Params, or only external Params, then we can run-time prune during
>> startup of the executor.
>
> This does not refute my question.  Why doesn't the same logic apply
> to any stable expression?  That is, ISTM a Param is a special case
> of that.
>

+1.

I don't think we need to perform pruning at the start of execution,
but we could fold all the stable expressions to constants at that
time. The PARAM_EXECs can not be folded into constant at execution
start since those not assigned any values yet. AFAIU expressions,
within a given node, with those parameters can be folded into
constants (if possible) during ExecRewind() on that node. We have to
perform pruning just before the (Merge)Append node scan starts.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 8 June 2018 at 15:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> David Rowley <david.rowley@2ndquadrant.com> writes:
>> On 8 June 2018 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Maybe there's something I'm missing here, but I sort of hoped that this
>>> patch would nuke all the special-case code for Params in this area.
>>> Why is there any need to distinguish them from other stable expressions?
>
>> We need to know which Params exist in the Expr as if there are no
>> Params, or only external Params, then we can run-time prune during
>> startup of the executor.
>
> This does not refute my question.  Why doesn't the same logic apply
> to any stable expression?  That is, ISTM a Param is a special case
> of that.

Okay, maybe we don't need to know which external params exist, but we
need to know if there are any exec params so that we don't try to
evaluate an expression with any of those during executor startup.

I'll produce a patch which simplifies things in that area.

Thanks for looking at this.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 8 June 2018 at 18:14, David Rowley <david.rowley@2ndquadrant.com> wrote:
> On 8 June 2018 at 15:22, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> David Rowley <david.rowley@2ndquadrant.com> writes:
>>> On 8 June 2018 at 03:43, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> Maybe there's something I'm missing here, but I sort of hoped that this
>>>> patch would nuke all the special-case code for Params in this area.
>>>> Why is there any need to distinguish them from other stable expressions?
>>
>>> We need to know which Params exist in the Expr as if there are no
>>> Params, or only external Params, then we can run-time prune during
>>> startup of the executor.
>>
>> This does not refute my question.  Why doesn't the same logic apply
>> to any stable expression?  That is, ISTM a Param is a special case
>> of that.
>
> Okay, maybe we don't need to know which external params exist, but we
> need to know if there are any exec params so that we don't try to
> evaluate an expression with any of those during executor startup.
>
> I'll produce a patch which simplifies things in that area.

Okay, I've gotten rid of the tracking of external params. We now just
track exec params. We still need to know about these so we know if a
re-prune is required during ExecReScanAppend(). Obviously, we don't
want to prune on any random Param change, so I'm fairly sure it's a
good idea to keep track of these.

I've changed the code inside partkey_datum_from_expr so that it's a
simple bool array lookup to decide if we can evaluate the expression
or not. This bool array is populated during planning, which I think is
rather nice so we don't have to go and do it each time the plan is
executed.

I also discovered that I was needlessly running the pruning code again
during executor run in some cases where there was no possibility of
doing any further pruning there.  I've had to add some new code to set
the present_parts inside ExecFindInitialMatchingSubPlans(). It now
properly removes the member of any sub-partitions which have had all
of their partitions pruned. This allows us just to use 'present_parts'
to calculate the subnodes from, rather than going and calling the
pruning code again.

Technically PartitionPruningData does not really need the
do_exec_prune field. A non-empty execparams could indicate this, but I
felt it was better to have the bool so that we have one for each
method of run-time pruning. This also saves a bms_is_empty() call
inside find_subplans_for_params_recurse(). This could be a bit of a
hotspot during parameterized nested loops which cause partition
pruning.

I'm really hoping this is what you meant about the special-case code for Params.

Does this look any better?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: why partition pruning doesn't work?

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> I'm really hoping this is what you meant about the special-case code for Params.
> Does this look any better?

I'm starting to look this over and it seems like generally the right
thing, though I'm finding minor things I don't like much.

One thing I'm wondering about is why in the world are PartitionPruneInfo
and its subsidiary struct types declared in primnodes.h?  They are not
general-purpose expression nodes, or if they are then there are an awful
lot of places that should know about them and do not.  AFAICT they might
belong in plannodes.h.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
I wrote:
> One thing I'm wondering about is why in the world are PartitionPruneInfo
> and its subsidiary struct types declared in primnodes.h?

Oh, and while I'm bitching: it seems like there is hardly any part of
the partitioning code in which the comments aren't desperately in need
of a copy-editing pass.  They are just chock-full of misspellings,
grammar that is faulty enough to make the meaning unclear, and/or
errors of fact.  An example of the latter is the repeated claims that
the basic partitioning functions belong to the planner.  Maybe that
was true at some stage of development; but AFAICS the logic in question
now lives in src/backend/partitioning/, which I would not think is
part of the planner.

            regards, tom lane


Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 10 June 2018 at 09:00, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I wrote:
>> One thing I'm wondering about is why in the world are PartitionPruneInfo
>> and its subsidiary struct types declared in primnodes.h?

That may have been a legacy thing that accidentally wasn't changed
from a previous version of the patch. I've now moved it to
plannodes.h.

> Oh, and while I'm bitching: it seems like there is hardly any part of
> the partitioning code in which the comments aren't desperately in need
> of a copy-editing pass.  They are just chock-full of misspellings,
> grammar that is faulty enough to make the meaning unclear, and/or
> errors of fact.  An example of the latter is the repeated claims that
> the basic partitioning functions belong to the planner.  Maybe that
> was true at some stage of development; but AFAICS the logic in question
> now lives in src/backend/partitioning/, which I would not think is
> part of the planner.

I've made a pass over the execPartition.c and partprune.c code
attempting to resolve these issues.  I have hopefully fixed them all,
but I apologise if I've missed any.

I also couldn't resist making a few other improvements to the code.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: why partition pruning doesn't work?

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> I've made a pass over the execPartition.c and partprune.c code
> attempting to resolve these issues.  I have hopefully fixed them all,
> but I apologise if I've missed any.
> I also couldn't resist making a few other improvements to the code.

By the time this arrived, I'd already whacked around your v4 patch
quite a bit, so rather than start over I just kept going with what
I had, and then tried to merge the useful bits of this one after
the fact.  I intentionally left out a couple of changes I couldn't
get excited about (such as having find_subplans_for_params_recurse
return a count), but I think I got everything in v5 otherwise.

I'm still fairly unhappy about the state of the comments, though.
It's very unclear for example what the subplan_map and subpart_map
arrays really are, eg what are they indexed by?  I get the impression
that only one of them can have a non-minus-1 value for a given index,
but that's nowhere explained.  Also, we have

 * partprunedata        Array of PartitionPruningData for the plan's target
 *                      partitioned relation. First element contains the
 *                      details for the target partitioned table.

And?  What are the other elements, what's the index rule, is there a
specific ordering for the other elements?  For that matter, "target
partitioned table" is content-free.  Do you mean topmost partitioned
table?  I suspect we expect the hierarchy to be flattened such that
ancestors appear before children, but that's not stated --- and if it
were, this bit about the first element would be a consequence of that.

Code-wise, there are some loose ends to be looked at.

* Related to the above, doesn't the loop at execPartition.c:1658 need
to work back-to-front?  Seems like it's trying to propagate info upwards
in the hierarchy; looking at a subpartition's present_parts value when
you still might change it later doesn't look right at all.

* partkey_datum_from_expr and its caller seem pretty brain-dead with
respect to nulls.  It's not even considering the possibility that a
Const has constisnull = true.  Now perhaps such a case can't reach
here because of plan-time constant-folding, but I don't think this code
has any business assuming that.  It's also being very stupid about
null values from expressions, just throwing up its hands and supposing
that nothing can be proven.  In reality, doesn't a null guarantee we
can eliminate all partitions, since the comparison functions are
presumed strict?

* I'm fairly suspicious of the fmgr_info and fmgr_info_copy calls in
perform_pruning_base_step.  Those seem likely to leak memory, and
for sure they destroy any opportunity for the called comparison
function to cache info in fn_extra --- something that's critical
for performance for some comparison functions such as array_eq.
Why is it necessary to suppose that those functions could change
from one execution to the next?

* The business with ExecFindInitialMatchingSubPlans remapping the
subplan indexes seems very dubious to me.  Surely, that is adding
way more complexity and fragility than it's worth.

            regards, tom lane


Re: why partition pruning doesn't work?

From
David Rowley
Date:
Thanks for working on and pushing those fixes.

On 11 June 2018 at 10:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It's very unclear for example what the subplan_map and subpart_map
> arrays really are, eg what are they indexed by?  I get the impression
> that only one of them can have a non-minus-1 value for a given index,
> but that's nowhere explained.  Also, we have

They're indexed by the partition indexes as returned by the partition
pruning code. I've included a patch to fix this.

>  * partprunedata        Array of PartitionPruningData for the plan's target
>  *                      partitioned relation. First element contains the
>  *                      details for the target partitioned table.
>
> And?  What are the other elements, what's the index rule, is there a
> specific ordering for the other elements?  For that matter, "target
> partitioned table" is content-free.  Do you mean topmost partitioned
> table?  I suspect we expect the hierarchy to be flattened such that
> ancestors appear before children, but that's not stated --- and if it
> were, this bit about the first element would be a consequence of that.

I've hopefully improved this comment in the attached patch. Although
we may need to delay this until [1] has been addressed

Only the index of the first element is important. Looking at
add_paths_to_append_rel() the parents will always be listed before
their children.  That's not very well documented either, but having
the top-level parent as the first element is relied upon for more than
runtime partition pruning.

> Code-wise, there are some loose ends to be looked at.
>
> * Related to the above, doesn't the loop at execPartition.c:1658 need
> to work back-to-front?  Seems like it's trying to propagate info upwards
> in the hierarchy; looking at a subpartition's present_parts value when
> you still might change it later doesn't look right at all.

It works just fine front-to-back. That's why I added the 2nd loop. It
propagates the work done in the first loop, so the code, as it is now,
works if the array is in any order.

It may be possible to run the first loop back-to-front and get rid of
the 2nd one.  I've done this in the attached patch. I felt less
confident doing this earlier as the order of that array was not
defined well.

> * partkey_datum_from_expr and its caller seem pretty brain-dead with
> respect to nulls.  It's not even considering the possibility that a
> Const has constisnull = true.  Now perhaps such a case can't reach
> here because of plan-time constant-folding, but I don't think this code
> has any business assuming that.  It's also being very stupid about
> null values from expressions, just throwing up its hands and supposing
> that nothing can be proven.  In reality, doesn't a null guarantee we
> can eliminate all partitions, since the comparison functions are
> presumed strict?

You are right. I admit to not having thought more about that and just
taking the safe option. It makes sense to fix it. I imagine it could
be a bit annoying that pruning just gives up in such a case.  I've
added code for this in the attached patch.

> * I'm fairly suspicious of the fmgr_info and fmgr_info_copy calls in
> perform_pruning_base_step.  Those seem likely to leak memory, and
> for sure they destroy any opportunity for the called comparison
> function to cache info in fn_extra --- something that's critical
> for performance for some comparison functions such as array_eq.
> Why is it necessary to suppose that those functions could change
> from one execution to the next?

IIRC the problem there was that we didn't think of a good place to
cache the FmgrInfo. But rethinking of that now we could probably use
the same method as I used in run-time pruning to cache the exprstate.
Amit, what do you think?

> * The business with ExecFindInitialMatchingSubPlans remapping the
> subplan indexes seems very dubious to me.  Surely, that is adding
> way more complexity and fragility than it's worth.

It's all contained in a single .c file. It does not seem that hard to
get it right. Nothing outside of exexPartition.c has any business
changing anything in these arrays. Nothing else even has to look at
them.

I think having the ability to prune during executor initialisation is
enormously important. I think without it, this patch is less than half
as useful.  However, if you didn't mean removing the executor
initialise pruning, and just not remapping the arrays, then I'm not
sure how we'd do that.  I'd previously tried having NULL subnodes in
the Append and I thought it required too much hacking work in
explain.c, so I decided to just collapse the array instead and do what
was required to make it work after having removed the unneeded
subplans. Did you have another idea on how to do this?

[1] https://www.postgresql.org/message-id/CAKJS1f9KG5nnOFhigWm4ND5Ut-yU075iJyA%2BACVyQnZ-ZW1Qtw%40mail.gmail.com

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: why partition pruning doesn't work?

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On 11 June 2018 at 10:49, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> * The business with ExecFindInitialMatchingSubPlans remapping the
>> subplan indexes seems very dubious to me.  Surely, that is adding
>> way more complexity and fragility than it's worth.

> I think having the ability to prune during executor initialisation is
> enormously important. I think without it, this patch is less than half
> as useful.

Sure.

> However, if you didn't mean removing the executor
> initialise pruning, and just not remapping the arrays, then I'm not
> sure how we'd do that.  I'd previously tried having NULL subnodes in
> the Append and I thought it required too much hacking work in
> explain.c, so I decided to just collapse the array instead and do what
> was required to make it work after having removed the unneeded
> subplans. Did you have another idea on how to do this?

No, that was pretty much exactly what I was envisioning.  I have
not looked at the consequences for explain.c but it seemed like
it couldn't be all that bad; and to my mind the remapping business
in the partition code is pretty bad.  "It's all in one file" is not
a suitable justification for unintelligible, overcomplex code.

            regards, tom lane


Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 12 June 2018 at 02:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> However, if you didn't mean removing the executor
>> initialise pruning, and just not remapping the arrays, then I'm not
>> sure how we'd do that.  I'd previously tried having NULL subnodes in
>> the Append and I thought it required too much hacking work in
>> explain.c, so I decided to just collapse the array instead and do what
>> was required to make it work after having removed the unneeded
>> subplans. Did you have another idea on how to do this?
>
> No, that was pretty much exactly what I was envisioning.  I have
> not looked at the consequences for explain.c but it seemed like
> it couldn't be all that bad; and to my mind the remapping business
> in the partition code is pretty bad.  "It's all in one file" is not
> a suitable justification for unintelligible, overcomplex code.

By all means, please have a look then.

I've been down that route. I didn't like it. I particularly think the
memory fragmentation is very good grounds for a good cache hit ratio
either, not to mention the slowdown of bms_next_member when there are
large gaps in the set. Keep in mind that we may scan the Append many
times over when it's on the inside of a nested loop join.

What you're proposing exchanges logic that fits neatly into one
function for special logic that will be scattered all over explain.c.
I really don't think that's the better of the two evils.

I'd rather just see my last patch applied which simplifies the re-map
code by removing the 2nd loop.  Actually, even updating the
present_parts to remove the unneeded subpartitioned table indexes is
optional. We' just need to give up on the elog(ERROR, "partition
missing from subplans"); error and assume that case is fine.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On 12 June 2018 at 02:26, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> ... I'd previously tried having NULL subnodes in
>>> the Append and I thought it required too much hacking work in
>>> explain.c,

>> No, that was pretty much exactly what I was envisioning.

> What you're proposing exchanges logic that fits neatly into one
> function for special logic that will be scattered all over explain.c.
> I really don't think that's the better of the two evils.

As far as I can see, it'd involve about three or four lines' worth of
rewrite in one place-you-already-made-quite-ugly in explain.c, and an
added if-test in planstate_walk_members, and that'd be it.  That seems
like a pretty cheap price for being able to vastly simplify the invariants
for the pruning functions.  In fact, I doubt you'd even need two of them
anymore; just one with a bool flag for can-use-exec-params.

> I'd rather just see my last patch applied which simplifies the re-map
> code by removing the 2nd loop.  Actually, even updating the
> present_parts to remove the unneeded subpartitioned table indexes is
> optional. We' just need to give up on the elog(ERROR, "partition
> missing from subplans"); error and assume that case is fine.

The fact that you added that loop and only later decided it was
unnecessary seems to me to support my point pretty strongly: that
code is overcomplicated.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
I wrote:
> * I'm fairly suspicious of the fmgr_info and fmgr_info_copy calls in
> perform_pruning_base_step.  Those seem likely to leak memory, and
> for sure they destroy any opportunity for the called comparison
> function to cache info in fn_extra --- something that's critical
> for performance for some comparison functions such as array_eq.
> Why is it necessary to suppose that those functions could change
> from one execution to the next?

After looking closer, that code isn't just inefficient, it's flat
out broken.  The reason is that ExecSetupPartitionPruneState thinks
it can store some pointers into the target relation's relcache entry
in the PartitionPruneContext, and then continue to use those pointers
after closing the relcache entry.  Nope, you can't do that.

The only reason this code has appeared to pass buildfarm testing is
that when we do

                if (cmpfn != context->partsupfunc[keyno].fn_oid)
                    fmgr_info(cmpfn, &partsupfunc[keyno]);
                else ...

if the relcache entry that context->partsupfunc is pointing into
has been freed (and overwritten thanks to CLOBBER_FREED_MEMORY), then the
function OID comparison generally fails so that we do a fresh fmgr_info
call.  In the field it's quite likely that we'd accept and attempt to
use a partially-clobbered FmgrInfo; but it would only happen if a relcache
flush had caused the data to get released, so it could be awhile before
anybody saw it happen, let alone reproduced it enough to figure it out.

It's easy to demonstrate that there's a problem if you instrument this
code to log when the OID comparison fails, and then run the regression
tests with -DRELCACHE_FORCE_RELEASE: you get lots of reports like

2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.686 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.707 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351
2018-06-11 18:01:28.707 EDT [16734] LOG:  replace partsupfunc 2139062143 with 351

showing that context->partsupfunc has been overwritten by
CLOBBER_FREED_MEMORY.

If we had any buildfarm critters running valgrind on
RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have
detected use of uninitialized memory here ... but I don't think we have
any.  (The combination of valgrind and CCA would probably be too slow to
be practical :-(, though maybe somebody with a fast machine could do
the other thing.)

Not sure about a good fix for this.  It seems annoying to copy the
rel's whole partkey data structure into query-local storage, but
I'm not sure we have any choice.  On the bright side, there might
be an opportunity to get rid of repeated runtime fmgr_info lookups
in cross-type comparison situations.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Andrew Dunstan
Date:

On 06/11/2018 06:24 PM, Tom Lane wrote:
>
> If we had any buildfarm critters running valgrind on
> RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have
> detected use of uninitialized memory here ... but I don't think we have
> any.  (The combination of valgrind and CCA would probably be too slow to
> be practical :-(, though maybe somebody with a fast machine could do
> the other thing.)
>
>


I don't have a fast machine, but I do have a slow machine already 
running valgrind and not doing much else :-) Let's see how lousyjack 
does with -DRELCACHE_FORCE_RELEASE

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 06/11/2018 06:24 PM, Tom Lane wrote:
>> If we had any buildfarm critters running valgrind on
>> RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have
>> detected use of uninitialized memory here ... but I don't think we have
>> any.  (The combination of valgrind and CCA would probably be too slow to
>> be practical :-(, though maybe somebody with a fast machine could do
>> the other thing.)

> I don't have a fast machine, but I do have a slow machine already
> running valgrind and not doing much else :-) Let's see how lousyjack
> does with -DRELCACHE_FORCE_RELEASE

I just tried the case here, and it doesn't even get as far as any
of the partitioning tests: it bombs out in inherit.sql :-(

==00:00:06:55.816 26107== Invalid read of size 4
==00:00:06:55.816 26107==    at 0x5F3978: ATExecDropInherit (tablecmds.c:11928)
==00:00:06:55.816 26107==    by 0x60212A: ATExecCmd (tablecmds.c:4241)
==00:00:06:55.816 26107==    by 0x602CC4: ATController (tablecmds.c:3976)
==00:00:06:55.816 26107==    by 0x7910EA: ProcessUtilitySlow (utility.c:1117)
==00:00:06:55.816 26107==    by 0x79180F: standard_ProcessUtility (utility.c:920)
==00:00:06:55.816 26107==    by 0x78D748: PortalRunUtility (pquery.c:1178)
==00:00:06:55.816 26107==    by 0x78E6D0: PortalRunMulti (pquery.c:1331)
==00:00:06:55.816 26107==    by 0x78EF8F: PortalRun (pquery.c:799)
==00:00:06:55.816 26107==    by 0x78B35C: exec_simple_query (postgres.c:1122)
==00:00:06:55.816 26107==    by 0x78C8B3: PostgresMain (postgres.c:4153)
==00:00:06:55.816 26107==    by 0x70FBD5: PostmasterMain (postmaster.c:4361)
==00:00:06:55.816 26107==    by 0x67AE4F: main (main.c:228)
==00:00:06:55.816 26107==  Address 0xe823e90 is 179,504 bytes inside a recently re-allocated block of size 524,288
alloc'd
==00:00:06:55.816 26107==    at 0x4A06A2E: malloc (vg_replace_malloc.c:270)
==00:00:06:55.816 26107==    by 0x8BBB35: AllocSetAlloc (aset.c:923)
==00:00:06:55.816 26107==    by 0x8C4473: palloc (mcxt.c:938)
==00:00:06:55.816 26107==    by 0x488DEF: CreateTemplateTupleDesc (tupdesc.c:66)
==00:00:06:55.816 26107==    by 0x88D2C0: RelationBuildDesc (relcache.c:416)
==00:00:06:55.816 26107==    by 0x8904B5: RelationIdGetRelation (relcache.c:1943)
==00:00:06:55.816 26107==    by 0x4C93BF: relation_open (heapam.c:1135)
==00:00:06:55.816 26107==    by 0x4D8305: index_open (indexam.c:154)
==00:00:06:55.816 26107==    by 0x62D6EB: ExecOpenIndices (execIndexing.c:197)
==00:00:06:55.816 26107==    by 0x53B607: CatalogOpenIndexes (indexing.c:49)
==00:00:06:55.816 26107==    by 0x556467: recordMultipleDependencies (pg_depend.c:112)
==00:00:06:55.816 26107==    by 0x560D44: create_toast_table (toasting.c:385)


That one's pretty obvious when you look at the code:

    /* keep our lock on the parent relation until commit */
    heap_close(parent_rel, NoLock);

    ObjectAddressSet(address, RelationRelationId,
                     RelationGetRelid(parent_rel));

It looks like this might be a fruitful source of creepie-crawlies.

            regards, tom lane


Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 12 June 2018 at 10:24, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After looking closer, that code isn't just inefficient, it's flat
> out broken.  The reason is that ExecSetupPartitionPruneState thinks
> it can store some pointers into the target relation's relcache entry
> in the PartitionPruneContext, and then continue to use those pointers
> after closing the relcache entry.  Nope, you can't do that.

I think the best fix is to just have a separate FmgrInfo for each step
and partkey comparison. Some FmgrInfos will end up identical, but
that's probably a small price to pay. Perhaps they should be separate
anyway so that the fn_extra is not shared between different quals
comparing to the same partition key?

I went with an array of FmgrInfos rather than an array of pointers to
FmgrInfos for cache efficiency.  This does require that InvalidOid is
0, since I've palloc0'd that memory, and I'm checking if the cache is
yet to be populated with: if
(!OidIsValid(context->stepcmpfuncs[stateidx].fn_oid))

Patch attached.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: why partition pruning doesn't work?

From
Andrew Dunstan
Date:

On 06/11/2018 06:41 PM, Andrew Dunstan wrote:
>
>
> On 06/11/2018 06:24 PM, Tom Lane wrote:
>>
>> If we had any buildfarm critters running valgrind on
>> RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have
>> detected use of uninitialized memory here ... but I don't think we have
>> any.  (The combination of valgrind and CCA would probably be too slow to
>> be practical :-(, though maybe somebody with a fast machine could do
>> the other thing.)
>>
>>
>
>
> I don't have a fast machine, but I do have a slow machine already 
> running valgrind and not doing much else :-) Let's see how lousyjack 
> does with -DRELCACHE_FORCE_RELEASE
>
>


It added about 20% to the run time. That's tolerable, so I'll leave it on.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: why partition pruning doesn't work?

From
Ashutosh Bapat
Date:
On Tue, Jun 12, 2018 at 3:54 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Not sure about a good fix for this.  It seems annoying to copy the
> rel's whole partkey data structure into query-local storage, but
> I'm not sure we have any choice.  On the bright side, there might
> be an opportunity to get rid of repeated runtime fmgr_info lookups
> in cross-type comparison situations.
>

We already do that while building part_scheme. So, if we are in
planner, it's readily available through RelOptInfo. If we need it in
the executor, we need to pass it down from RelOptInfo into one of the
execution states. I haven't looked at the patch to exactly figure out
which of these is true.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company


Re: why partition pruning doesn't work?

From
Robert Haas
Date:
On Mon, Jun 11, 2018 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Not sure about a good fix for this.  It seems annoying to copy the
> rel's whole partkey data structure into query-local storage, but
> I'm not sure we have any choice.  On the bright side, there might
> be an opportunity to get rid of repeated runtime fmgr_info lookups
> in cross-type comparison situations.

Is this the same issue I raised in
https://www.postgresql.org/message-id/flat/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com
or a similar issue that creeps up at execution time?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Jun 11, 2018 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Not sure about a good fix for this.  It seems annoying to copy the
>> rel's whole partkey data structure into query-local storage, but
>> I'm not sure we have any choice.  On the bright side, there might
>> be an opportunity to get rid of repeated runtime fmgr_info lookups
>> in cross-type comparison situations.

> Is this the same issue I raised in
> https://www.postgresql.org/message-id/flat/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com
> or a similar issue that creeps up at execution time?

Well, it's related to that: *if* we held the relcache entry open for
the duration of the query, and *if* holding such a pin were sufficient
to guarantee the contents of the entry's partition data couldn't change
or even move, then we could avoid doing so much copying.  But as we
discussed then, neither condition is true, and I don't think either one is
cheap to make true.  Certainly there's no logic in the relcache to detect
changes of partition data like we do for, say, triggers.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Robert Haas
Date:
On Tue, Jun 12, 2018 at 12:25 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Jun 11, 2018 at 6:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Not sure about a good fix for this.  It seems annoying to copy the
>>> rel's whole partkey data structure into query-local storage, but
>>> I'm not sure we have any choice.  On the bright side, there might
>>> be an opportunity to get rid of repeated runtime fmgr_info lookups
>>> in cross-type comparison situations.
>
>> Is this the same issue I raised in
>> https://www.postgresql.org/message-id/flat/CA%2BTgmoYKToP4-adCFFRNrO21OGuH%3Dphx-fiB1dYoqksNYX6YHQ%40mail.gmail.com
>> or a similar issue that creeps up at execution time?
>
> Well, it's related to that: *if* we held the relcache entry open for
> the duration of the query, and *if* holding such a pin were sufficient
> to guarantee the contents of the entry's partition data couldn't change
> or even move, then we could avoid doing so much copying.  But as we
> discussed then, neither condition is true, and I don't think either one is
> cheap to make true.  Certainly there's no logic in the relcache to detect
> changes of partition data like we do for, say, triggers.

I think we DO hold relations open for the duration of execution
(though not necessarily between planning and execution).  And there is
code in RelationClearRelation to avoid changing rd_partkey and
rd_partdesc if no logical change has occurred.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I think we DO hold relations open for the duration of execution
> (though not necessarily between planning and execution).  And there is
> code in RelationClearRelation to avoid changing rd_partkey and
> rd_partdesc if no logical change has occurred.

Testing with valgrind + RELCACHE_FORCE_RELEASE is sufficient to disprove
that, cf current results from lousyjack (which match my own testing).
The partkey *is* disappearing under us.

While I've not looked into the exact reasons for that, my first guess
is that the partitioned table is not held open because it's not one
of the ones to be scanned.  Are you prepared to change something like
that at this stage of the release cycle?

            regards, tom lane


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> And there is
> code in RelationClearRelation to avoid changing rd_partkey and
> rd_partdesc if no logical change has occurred.

Oh, and by the way, what's actually in there is

        keep_partkey = (relation->rd_partkey != NULL);

I would be interested to see an explanation of why that isn't utterly
broken.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Robert Haas
Date:
On Tue, Jun 12, 2018 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Testing with valgrind + RELCACHE_FORCE_RELEASE is sufficient to disprove
> that, cf current results from lousyjack (which match my own testing).
> The partkey *is* disappearing under us.
>
> While I've not looked into the exact reasons for that, my first guess
> is that the partitioned table is not held open because it's not one
> of the ones to be scanned.  Are you prepared to change something like
> that at this stage of the release cycle?

The partition key is immutable, so it should NOT be able to disappear
out from under us.  Once you have defined the partitioning strategy
for a table and the partitioning keys associated with it, you can't
ever change it.  The only reason we need keep_partkey at all, as
opposed to just assume that the relevant portion of the relcache entry
can't change at all, is because during relation creation we are
briefly in a state where the pg_class row exists and the
pg_partitioned_table row hasn't been set up yet.  So I think your
guess that the relation is not kept open is likely to be correct.

As for whether to change it at this point in the release cycle, I
guess that, to me, depends on how invasive the fix is.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Jun 12, 2018 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> While I've not looked into the exact reasons for that, my first guess
>> is that the partitioned table is not held open because it's not one
>> of the ones to be scanned.  Are you prepared to change something like
>> that at this stage of the release cycle?

> The partition key is immutable, so it should NOT be able to disappear
> out from under us.

Hm.  That could be better documented.

> As for whether to change it at this point in the release cycle, I
> guess that, to me, depends on how invasive the fix is.

It seems not to be that bad: we just need a shutdown call for the
PartitionPruneState, and then we can remember the open relation there.
The attached is based on David's patch from yesterday.

I'm still a bit annoyed at the fmgr_info_copy calls in this.  It'd be
better to use the FmgrInfos in the relcache when applicable.  However,
mixing those with the cross-type ones would seem to require that we change
the API for get_matching_hash_bounds et al from taking "FmgrInfo *" to
taking "FmgrInfo **", which looks rather invasive.

            regards, tom lane

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 33513ff..4eeee7c 100644
*** a/src/backend/executor/execPartition.c
--- b/src/backend/executor/execPartition.c
*************** adjust_partition_tlist(List *tlist, Tupl
*** 1357,1367 ****
   *
   * Functions:
   *
!  * ExecSetupPartitionPruneState:
   *        Creates the PartitionPruneState required by each of the two pruning
   *        functions.  Details stored include how to map the partition index
   *        returned by the partition pruning code into subplan indexes.
   *
   * ExecFindInitialMatchingSubPlans:
   *        Returns indexes of matching subplans.  Partition pruning is attempted
   *        without any evaluation of expressions containing PARAM_EXEC Params.
--- 1357,1370 ----
   *
   * Functions:
   *
!  * ExecCreatePartitionPruneState:
   *        Creates the PartitionPruneState required by each of the two pruning
   *        functions.  Details stored include how to map the partition index
   *        returned by the partition pruning code into subplan indexes.
   *
+  * ExecDestroyPartitionPruneState:
+  *        Deletes a PartitionPruneState. Must be called during executor shutdown.
+  *
   * ExecFindInitialMatchingSubPlans:
   *        Returns indexes of matching subplans.  Partition pruning is attempted
   *        without any evaluation of expressions containing PARAM_EXEC Params.
*************** adjust_partition_tlist(List *tlist, Tupl
*** 1382,1389 ****
   */

  /*
!  * ExecSetupPartitionPruneState
!  *        Set up the data structure required for calling
   *        ExecFindInitialMatchingSubPlans and ExecFindMatchingSubPlans.
   *
   * 'planstate' is the parent plan node's execution state.
--- 1385,1392 ----
   */

  /*
!  * ExecCreatePartitionPruneState
!  *        Build the data structure required for calling
   *        ExecFindInitialMatchingSubPlans and ExecFindMatchingSubPlans.
   *
   * 'planstate' is the parent plan node's execution state.
*************** adjust_partition_tlist(List *tlist, Tupl
*** 1395,1401 ****
   * in each PartitionPruneInfo.
   */
  PartitionPruneState *
! ExecSetupPartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
  {
      PartitionPruneState *prunestate;
      PartitionPruningData *prunedata;
--- 1398,1404 ----
   * in each PartitionPruneInfo.
   */
  PartitionPruneState *
! ExecCreatePartitionPruneState(PlanState *planstate, List *partitionpruneinfo)
  {
      PartitionPruneState *prunestate;
      PartitionPruningData *prunedata;
*************** ExecSetupPartitionPruneState(PlanState *
*** 1435,1445 ****
          PartitionPruningData *pprune = &prunedata[i];
          PartitionPruneContext *context = &pprune->context;
          PartitionDesc partdesc;
-         Relation    rel;
          PartitionKey partkey;
-         ListCell   *lc2;
          int            partnatts;
          int            n_steps;

          /*
           * We must copy the subplan_map rather than pointing directly to the
--- 1438,1447 ----
          PartitionPruningData *pprune = &prunedata[i];
          PartitionPruneContext *context = &pprune->context;
          PartitionDesc partdesc;
          PartitionKey partkey;
          int            partnatts;
          int            n_steps;
+         ListCell   *lc2;

          /*
           * We must copy the subplan_map rather than pointing directly to the
*************** ExecSetupPartitionPruneState(PlanState *
*** 1456,1481 ****
          pprune->present_parts = bms_copy(pinfo->present_parts);

          /*
!          * Grab some info from the table's relcache; lock was already obtained
!          * by ExecLockNonLeafAppendTables.
           */
!         rel = relation_open(pinfo->reloid, NoLock);

!         partkey = RelationGetPartitionKey(rel);
!         partdesc = RelationGetPartitionDesc(rel);

          context->strategy = partkey->strategy;
          context->partnatts = partnatts = partkey->partnatts;
!         context->partopfamily = partkey->partopfamily;
!         context->partopcintype = partkey->partopcintype;
          context->partcollation = partkey->partcollation;
          context->partsupfunc = partkey->partsupfunc;
!         context->nparts = pinfo->nparts;
!         context->boundinfo = partition_bounds_copy(partdesc->boundinfo, partkey);
          context->planstate = planstate;

          /* Initialize expression state for each expression we need */
-         n_steps = list_length(pinfo->pruning_steps);
          context->exprstates = (ExprState **)
              palloc0(sizeof(ExprState *) * n_steps * partnatts);
          foreach(lc2, pinfo->pruning_steps)
--- 1458,1490 ----
          pprune->present_parts = bms_copy(pinfo->present_parts);

          /*
!          * We need to hold a pin on the partitioned table's relcache entry so
!          * that we can rely on its copies of the table's partition key and
!          * partition descriptor.  We need not get a lock though; one should
!          * have been acquired already by InitPlan or
!          * ExecLockNonLeafAppendTables.
           */
!         context->partrel = relation_open(pinfo->reloid, NoLock);

!         partkey = RelationGetPartitionKey(context->partrel);
!         partdesc = RelationGetPartitionDesc(context->partrel);
!         n_steps = list_length(pinfo->pruning_steps);

          context->strategy = partkey->strategy;
          context->partnatts = partnatts = partkey->partnatts;
!         context->nparts = pinfo->nparts;
!         context->boundinfo = partdesc->boundinfo;
          context->partcollation = partkey->partcollation;
          context->partsupfunc = partkey->partsupfunc;
!
!         /* We'll look up type-specific support functions as needed */
!         context->stepcmpfuncs = (FmgrInfo *)
!             palloc0(sizeof(FmgrInfo) * n_steps * partnatts);
!
!         context->ppccontext = CurrentMemoryContext;
          context->planstate = planstate;

          /* Initialize expression state for each expression we need */
          context->exprstates = (ExprState **)
              palloc0(sizeof(ExprState *) * n_steps * partnatts);
          foreach(lc2, pinfo->pruning_steps)
*************** ExecSetupPartitionPruneState(PlanState *
*** 1527,1534 ****
          prunestate->execparamids = bms_add_members(prunestate->execparamids,
                                                     pinfo->execparamids);

-         relation_close(rel, NoLock);
-
          i++;
      }

--- 1536,1541 ----
*************** ExecSetupPartitionPruneState(PlanState *
*** 1536,1541 ****
--- 1543,1568 ----
  }

  /*
+  * ExecDestroyPartitionPruneState
+  *        Release resources at plan shutdown.
+  *
+  * We don't bother to free any memory here, since the whole executor context
+  * will be going away shortly.  We do need to release our relcache pins.
+  */
+ void
+ ExecDestroyPartitionPruneState(PartitionPruneState *prunestate)
+ {
+     int            i;
+
+     for (i = 0; i < prunestate->num_partprunedata; i++)
+     {
+         PartitionPruningData *pprune = &prunestate->partprunedata[i];
+
+         relation_close(pprune->context.partrel, NoLock);
+     }
+ }
+
+ /*
   * ExecFindInitialMatchingSubPlans
   *        Identify the set of subplans that cannot be eliminated by initial
   *        pruning (disregarding any pruning constraints involving PARAM_EXEC
diff --git a/src/backend/executor/nodeAppend.c b/src/backend/executor/nodeAppend.c
index 6dd53e9..5ce4fb4 100644
*** a/src/backend/executor/nodeAppend.c
--- b/src/backend/executor/nodeAppend.c
*************** ExecInitAppend(Append *node, EState *est
*** 136,143 ****
          /* We may need an expression context to evaluate partition exprs */
          ExecAssignExprContext(estate, &appendstate->ps);

!         prunestate = ExecSetupPartitionPruneState(&appendstate->ps,
!                                                   node->part_prune_infos);

          /* Perform an initial partition prune, if required. */
          if (prunestate->do_initial_prune)
--- 136,145 ----
          /* We may need an expression context to evaluate partition exprs */
          ExecAssignExprContext(estate, &appendstate->ps);

!         /* Create the working data structure for pruning. */
!         prunestate = ExecCreatePartitionPruneState(&appendstate->ps,
!                                                    node->part_prune_infos);
!         appendstate->as_prune_state = prunestate;

          /* Perform an initial partition prune, if required. */
          if (prunestate->do_initial_prune)
*************** ExecInitAppend(Append *node, EState *est
*** 178,185 ****
           */
          if (!prunestate->do_exec_prune)
              appendstate->as_valid_subplans = bms_add_range(NULL, 0, nplans - 1);
-
-         appendstate->as_prune_state = prunestate;
      }
      else
      {
--- 180,185 ----
*************** ExecEndAppend(AppendState *node)
*** 330,335 ****
--- 330,341 ----
       */
      for (i = 0; i < nplans; i++)
          ExecEndNode(appendplans[i]);
+
+     /*
+      * release any resources associated with run-time pruning
+      */
+     if (node->as_prune_state)
+         ExecDestroyPartitionPruneState(node->as_prune_state);
  }

  void
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 856bdd3..ee5826e 100644
*** a/src/backend/partitioning/partprune.c
--- b/src/backend/partitioning/partprune.c
*************** prune_append_rel_partitions(RelOptInfo *
*** 436,449 ****
      if (contradictory)
          return NULL;

      context.strategy = rel->part_scheme->strategy;
      context.partnatts = rel->part_scheme->partnatts;
-     context.partopfamily = rel->part_scheme->partopfamily;
-     context.partopcintype = rel->part_scheme->partopcintype;
-     context.partcollation = rel->part_scheme->partcollation;
-     context.partsupfunc = rel->part_scheme->partsupfunc;
      context.nparts = rel->nparts;
      context.boundinfo = rel->boundinfo;

      /* These are not valid when being called from the planner */
      context.planstate = NULL;
--- 436,453 ----
      if (contradictory)
          return NULL;

+     /* Set up PartitionPruneContext */
+     context.partrel = NULL;
      context.strategy = rel->part_scheme->strategy;
      context.partnatts = rel->part_scheme->partnatts;
      context.nparts = rel->nparts;
      context.boundinfo = rel->boundinfo;
+     context.partcollation = rel->part_scheme->partcollation;
+     context.partsupfunc = rel->part_scheme->partsupfunc;
+     context.stepcmpfuncs = (FmgrInfo *) palloc0(sizeof(FmgrInfo) *
+                                                 context.partnatts *
+                                                 list_length(pruning_steps));
+     context.ppccontext = CurrentMemoryContext;

      /* These are not valid when being called from the planner */
      context.planstate = NULL;
*************** perform_pruning_base_step(PartitionPrune
*** 2809,2815 ****
      int            keyno,
                  nvalues;
      Datum        values[PARTITION_MAX_KEYS];
!     FmgrInfo    partsupfunc[PARTITION_MAX_KEYS];

      /*
       * There better be the same number of expressions and compare functions.
--- 2813,2820 ----
      int            keyno,
                  nvalues;
      Datum        values[PARTITION_MAX_KEYS];
!     FmgrInfo   *partsupfunc;
!     int            stateidx;

      /*
       * There better be the same number of expressions and compare functions.
*************** perform_pruning_base_step(PartitionPrune
*** 2844,2850 ****
          if (lc1 != NULL)
          {
              Expr       *expr;
-             int            stateidx;
              Datum        datum;
              bool        isnull;

--- 2849,2854 ----
*************** perform_pruning_base_step(PartitionPrune
*** 2873,2891 ****
                      return result;
                  }

!                 /*
!                  * If we're going to need a different comparison function than
!                  * the one cached in the PartitionKey, we'll need to look up
!                  * the FmgrInfo.
!                  */
                  cmpfn = lfirst_oid(lc2);
                  Assert(OidIsValid(cmpfn));
!                 if (cmpfn != context->partsupfunc[keyno].fn_oid)
!                     fmgr_info(cmpfn, &partsupfunc[keyno]);
!                 else
!                     fmgr_info_copy(&partsupfunc[keyno],
!                                    &context->partsupfunc[keyno],
!                                    CurrentMemoryContext);

                  values[keyno] = datum;
                  nvalues++;
--- 2877,2901 ----
                      return result;
                  }

!                 /* Set up the stepcmpfuncs entry, unless we already did */
                  cmpfn = lfirst_oid(lc2);
                  Assert(OidIsValid(cmpfn));
!                 if (cmpfn != context->stepcmpfuncs[stateidx].fn_oid)
!                 {
!                     /*
!                      * If the needed support function is the same one cached
!                      * in the relation's partition key, copy the cached
!                      * FmgrInfo.  Otherwise (i.e., when we have a cross-type
!                      * comparison), an actual lookup is required.
!                      */
!                     if (cmpfn == context->partsupfunc[keyno].fn_oid)
!                         fmgr_info_copy(&context->stepcmpfuncs[stateidx],
!                                        &context->partsupfunc[keyno],
!                                        context->ppccontext);
!                     else
!                         fmgr_info_cxt(cmpfn, &context->stepcmpfuncs[stateidx],
!                                       context->ppccontext);
!                 }

                  values[keyno] = datum;
                  nvalues++;
*************** perform_pruning_base_step(PartitionPrune
*** 2896,2901 ****
--- 2906,2918 ----
          }
      }

+     /*
+      * Point partsupfunc to the entry for the 0th key of this step; the
+      * additional support functions, if any, follow consecutively.
+      */
+     stateidx = PruneCxtStateIdx(context->partnatts, opstep->step.step_id, 0);
+     partsupfunc = &context->stepcmpfuncs[stateidx];
+
      switch (context->strategy)
      {
          case PARTITION_STRATEGY_HASH:
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index 3dfb1b8..dc69280 100644
*** a/src/backend/utils/cache/relcache.c
--- b/src/backend/utils/cache/relcache.c
*************** RelationClearRelation(Relation relation,
*** 2401,2406 ****
--- 2401,2407 ----
          keep_tupdesc = equalTupleDescs(relation->rd_att, newrel->rd_att);
          keep_rules = equalRuleLocks(relation->rd_rules, newrel->rd_rules);
          keep_policies = equalRSDesc(relation->rd_rsdesc, newrel->rd_rsdesc);
+         /* partkey is immutable once set up, so we can always keep it */
          keep_partkey = (relation->rd_partkey != NULL);
          keep_partdesc = equalPartitionDescs(relation->rd_partkey,
                                              relation->rd_partdesc,
*************** RelationClearRelation(Relation relation,
*** 2445,2451 ****
          SWAPFIELD(Form_pg_class, rd_rel);
          /* ... but actually, we don't have to update newrel->rd_rel */
          memcpy(relation->rd_rel, newrel->rd_rel, CLASS_TUPLE_SIZE);
!         /* preserve old tupledesc and rules if no logical change */
          if (keep_tupdesc)
              SWAPFIELD(TupleDesc, rd_att);
          if (keep_rules)
--- 2446,2452 ----
          SWAPFIELD(Form_pg_class, rd_rel);
          /* ... but actually, we don't have to update newrel->rd_rel */
          memcpy(relation->rd_rel, newrel->rd_rel, CLASS_TUPLE_SIZE);
!         /* preserve old tupledesc, rules, policies if no logical change */
          if (keep_tupdesc)
              SWAPFIELD(TupleDesc, rd_att);
          if (keep_rules)
*************** RelationClearRelation(Relation relation,
*** 2459,2471 ****
          SWAPFIELD(Oid, rd_toastoid);
          /* pgstat_info must be preserved */
          SWAPFIELD(struct PgStat_TableStatus *, pgstat_info);
!         /* partition key must be preserved, if we have one */
          if (keep_partkey)
          {
              SWAPFIELD(PartitionKey, rd_partkey);
              SWAPFIELD(MemoryContext, rd_partkeycxt);
          }
-         /* preserve old partdesc if no logical change */
          if (keep_partdesc)
          {
              SWAPFIELD(PartitionDesc, rd_partdesc);
--- 2460,2471 ----
          SWAPFIELD(Oid, rd_toastoid);
          /* pgstat_info must be preserved */
          SWAPFIELD(struct PgStat_TableStatus *, pgstat_info);
!         /* preserve old partitioning info if no logical change */
          if (keep_partkey)
          {
              SWAPFIELD(PartitionKey, rd_partkey);
              SWAPFIELD(MemoryContext, rd_partkeycxt);
          }
          if (keep_partdesc)
          {
              SWAPFIELD(PartitionDesc, rd_partdesc);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 71d639f..862bf65 100644
*** a/src/include/executor/execPartition.h
--- b/src/include/executor/execPartition.h
*************** extern HeapTuple ConvertPartitionTupleSl
*** 208,215 ****
                            TupleTableSlot **p_my_slot);
  extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
                          PartitionTupleRouting *proute);
! extern PartitionPruneState *ExecSetupPartitionPruneState(PlanState *planstate,
!                              List *partitionpruneinfo);
  extern Bitmapset *ExecFindMatchingSubPlans(PartitionPruneState *prunestate);
  extern Bitmapset *ExecFindInitialMatchingSubPlans(PartitionPruneState *prunestate,
                                  int nsubplans);
--- 208,216 ----
                            TupleTableSlot **p_my_slot);
  extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
                          PartitionTupleRouting *proute);
! extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
!                               List *partitionpruneinfo);
! extern void ExecDestroyPartitionPruneState(PartitionPruneState *prunestate);
  extern Bitmapset *ExecFindMatchingSubPlans(PartitionPruneState *prunestate);
  extern Bitmapset *ExecFindInitialMatchingSubPlans(PartitionPruneState *prunestate,
                                  int nsubplans);
diff --git a/src/include/partitioning/partprune.h b/src/include/partitioning/partprune.h
index e3b3bfb..09147b5 100644
*** a/src/include/partitioning/partprune.h
--- b/src/include/partitioning/partprune.h
***************
*** 20,68 ****

  /*
   * PartitionPruneContext
   *
!  * Information about a partitioned table needed to perform partition pruning.
   */
  typedef struct PartitionPruneContext
  {
!     /* Partition key information */
      char        strategy;
      int            partnatts;
-     Oid           *partopfamily;
-     Oid           *partopcintype;
-     Oid           *partcollation;
-     FmgrInfo   *partsupfunc;
-
-     /* Number of partitions */
      int            nparts;
-
-     /* Partition boundary info */
      PartitionBoundInfo boundinfo;
!
!     /*
!      * This will be set when the context is used from the executor, to allow
!      * Params to be evaluated.
!      */
      PlanState  *planstate;
-
-     /*
-      * Array of ExprStates, indexed as per PruneCtxStateIdx; one for each
-      * partkey in each pruning step.  Allocated if planstate is non-NULL,
-      * otherwise NULL.
-      */
      ExprState **exprstates;
-
-     /*
-      * Similar array of flags, each true if corresponding 'exprstate'
-      * expression contains any PARAM_EXEC Params.  (Can be NULL if planstate
-      * is NULL.)
-      */
      bool       *exprhasexecparam;
-
-     /* true if it's safe to evaluate PARAM_EXEC Params */
      bool        evalexecparams;
  } PartitionPruneContext;

  #define PruneCxtStateIdx(partnatts, step_id, keyno) \
      ((partnatts) * (step_id) + (keyno))

--- 20,76 ----

  /*
   * PartitionPruneContext
+  *        Stores information needed at runtime for pruning computations
+  *        related to a single partitioned table.
   *
!  * partrel            Relcache pointer for the partitioned table,
!  *                    if we have it open (else NULL).
!  * strategy            Partition strategy, e.g. LIST, RANGE, HASH.
!  * partnatts        Number of columns in the partition key.
!  * nparts            Number of partitions in this partitioned table.
!  * boundinfo        Partition boundary info for the partitioned table.
!  * partcollation    Array of partnatts elements, storing the collations of the
!  *                    partition key columns.
!  * partsupfunc        Array of FmgrInfos for the comparison or hashing functions
!  *                    associated with the partition keys (partnatts elements).
!  *                    (This points into the partrel's partition key, typically.)
!  * stepcmpfuncs        Array of FmgrInfos for the comparison or hashing function
!  *                    for each pruning step and partition key.
!  * ppccontext        Memory context holding this PartitionPruneContext's
!  *                    subsidiary data, such as the FmgrInfos.
!  * planstate        Points to the parent plan node's PlanState when called
!  *                    during execution; NULL when called from the planner.
!  * exprstates        Array of ExprStates, indexed as per PruneCtxStateIdx; one
!  *                    for each partition key in each pruning step.  Allocated if
!  *                    planstate is non-NULL, otherwise NULL.
!  * exprhasexecparam    Array of bools, each true if corresponding 'exprstate'
!  *                    expression contains any PARAM_EXEC Params.  (Can be NULL
!  *                    if planstate is NULL.)
!  * evalexecparams    True if it's safe to evaluate PARAM_EXEC Params.
   */
  typedef struct PartitionPruneContext
  {
!     Relation    partrel;
      char        strategy;
      int            partnatts;
      int            nparts;
      PartitionBoundInfo boundinfo;
!     Oid           *partcollation;
!     FmgrInfo   *partsupfunc;
!     FmgrInfo   *stepcmpfuncs;
!     MemoryContext ppccontext;
      PlanState  *planstate;
      ExprState **exprstates;
      bool       *exprhasexecparam;
      bool        evalexecparams;
  } PartitionPruneContext;

+ /*
+  * PruneCxtStateIdx() computes the correct index into the stepcmpfuncs[],
+  * exprstates[] and exprhasexecparam[] arrays for step step_id and
+  * partition key column keyno.  (Note: there is code that assumes the
+  * entries for a given step are sequential, so this is not chosen freely.)
+  */
  #define PruneCxtStateIdx(partnatts, step_id, keyno) \
      ((partnatts) * (step_id) + (keyno))


Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 13 June 2018 at 16:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> It seems not to be that bad: we just need a shutdown call for the
> PartitionPruneState, and then we can remember the open relation there.
> The attached is based on David's patch from yesterday.
>
> I'm still a bit annoyed at the fmgr_info_copy calls in this.  It'd be
> better to use the FmgrInfos in the relcache when applicable.  However,
> mixing those with the cross-type ones would seem to require that we change
> the API for get_matching_hash_bounds et al from taking "FmgrInfo *" to
> taking "FmgrInfo **", which looks rather invasive.

I've looked over this and it seems better than mine. Especially so
that you've done things so that the FmgrInfo is copied into a memory
context that's not about to get reset.

One small thing is that I'd move the:

context.partrel = NULL;

to under:

/* These are not valid when being called from the planner */

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: why partition pruning doesn't work?

From
Robert Haas
Date:
On Wed, Jun 13, 2018 at 12:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> As for whether to change it at this point in the release cycle, I
>> guess that, to me, depends on how invasive the fix is.
>
> It seems not to be that bad: we just need a shutdown call for the
> PartitionPruneState, and then we can remember the open relation there.
> The attached is based on David's patch from yesterday.

Seems reasonable.  Really, I think we should look for a way to hang
onto the relation at the point where it's originally opened and locked
instead of reopening it here.  But that's probably more invasive than
we can really justify right at the moment, and I think this is a step
in a good direction.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Seems reasonable.  Really, I think we should look for a way to hang
> onto the relation at the point where it's originally opened and locked
> instead of reopening it here.  But that's probably more invasive than
> we can really justify right at the moment, and I think this is a step
> in a good direction.

The existing coding there makes me itch a bit, because there's only a
rather fragile line of reasoning justifying the assumption that there is a
pre-existing lock at all.  So I'd be in favor of what you suggest just to
get rid of the "open(NoLock)" hazard.  But I agree that it'd be rather
invasive and right now is probably not the time for it.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On 13 June 2018 at 16:15, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It seems not to be that bad: we just need a shutdown call for the
>> PartitionPruneState, and then we can remember the open relation there.
>> The attached is based on David's patch from yesterday.

> I've looked over this and it seems better than mine. Especially so
> that you've done things so that the FmgrInfo is copied into a memory
> context that's not about to get reset.

Pushed, thanks for looking it over.

> One small thing is that I'd move the:
> context.partrel = NULL;
> to under:
> /* These are not valid when being called from the planner */

Judgment call I guess.  I tend to initialize struct fields in field order
unless there's a good reason to do differently, just so it's easier to
confirm that none were overlooked.  But I can see the point of your
suggestion too, so done that way.

There's still one thing I'm a bit confused about here.  I noticed that
we weren't actually using the partopfamily and partopcintype fields in
PartitionPruneContext, so I removed them.  But that still leaves both
partsupfunc and partcollation as pointers into the relcache that were
subject to this hazard.  My testing agrees with lousyjack's results
that both of those were, in fact, being improperly accessed.  The OID
comparison effect I mentioned upthread explains why the buildfarm's
cache-clobbering members failed to notice any problem with garbage
partsupfunc data ... but why did we not see any complaints about invalid
collation OIDs?  Tis strange.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Amit Langote
Date:
On 2018/06/13 23:39, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Seems reasonable.  Really, I think we should look for a way to hang
>> onto the relation at the point where it's originally opened and locked
>> instead of reopening it here.  But that's probably more invasive than
>> we can really justify right at the moment, and I think this is a step
>> in a good direction.
> 
> The existing coding there makes me itch a bit, because there's only a
> rather fragile line of reasoning justifying the assumption that there is a
> pre-existing lock at all.  So I'd be in favor of what you suggest just to
> get rid of the "open(NoLock)" hazard.  But I agree that it'd be rather
> invasive and right now is probably not the time for it.

I had sent a patch to try to get rid of the open(NoLock) there a couple of
months ago [1].  The idea was to both lock and open the relation in
ExecNonLeafAppendTables, which is the first time all partitioned tables in
a given Append node are locked for execution.  Also, the patch makes it a
responsibility of ExecEndAppend to release the relcache pins, so the
recently added ExecDestroyPartitionPruneState would not be needed.

Attached is a rebased version of that patch if there is interest in it.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/0b361a22-f995-e15c-a385-6d1b72dd0d13%40lab.ntt.co.jp

Attachment

Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 14 June 2018 at 04:10, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> There's still one thing I'm a bit confused about here.  I noticed that
> we weren't actually using the partopfamily and partopcintype fields in
> PartitionPruneContext, so I removed them.  But that still leaves both
> partsupfunc and partcollation as pointers into the relcache that were
> subject to this hazard.  My testing agrees with lousyjack's results
> that both of those were, in fact, being improperly accessed.  The OID
> comparison effect I mentioned upthread explains why the buildfarm's
> cache-clobbering members failed to notice any problem with garbage
> partsupfunc data ... but why did we not see any complaints about invalid
> collation OIDs?  Tis strange.

FWIW It's not working for me before e23bae82cf3 with
CLOBBER_FREED_MEMORY, CLOBBER_CACHE_ALWAYS and RELCACHE_FORCE_RELEASE,
and:

create table listp (a text) partition by list(a);
create table listp1 partition of listp for values in ('1');
select * from listp where a = (select '1');

I get:

ERROR:  cache lookup failed for collation 2139062143

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 14 June 2018 at 19:17, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
> I had sent a patch to try to get rid of the open(NoLock) there a couple of
> months ago [1].  The idea was to both lock and open the relation in
> ExecNonLeafAppendTables, which is the first time all partitioned tables in
> a given Append node are locked for execution.  Also, the patch makes it a
> responsibility of ExecEndAppend to release the relcache pins, so the
> recently added ExecDestroyPartitionPruneState would not be needed.

Robert and I briefly discussed something more complete a while back.
Not much detail was talked about, but the basic idea was to store the
Relation somewhere in the planner an executor that we could lookup by
rel index rather than having to relation_open all the time.

I just had a very quick look around and thought maybe RangeTblEntry
might be a good spot to store the Relation and current lock level that
we hold on that relation. This means that we can use
PlannerInfo->simple_rte_array in the planner and
EState->es_range_table in the executor. The latter of those would be
much nicer if we built an array instead of keeping the list (can
probably build that during InitPlan()).  We could then get hold of a
Relation object when needed without having to do relation_open(...,
NoLock).

Code which currently looks like:

reloid = getrelid(scanrelid, estate->es_range_table);
rel = heap_open(reloid, lockmode);

In places like ExecOpenScanRelation, could be replaced with some
wrapper function call like: rte_open_rel(estate, scanrelid, lockmode);

This could also be used to ERROR out if rte_open_rel() was done
initially with NoLock. Right now there are so many places that we do
relation_open(..., NoLock) and write a comment /* Lock already taken
by ... */, which we hope is always true.

If the Relation is already populated in the RangeTblEntry then the
function would just return that, otherwise, we'd just look inside the
RangeTblEntry for the relation Oid and do a heap_open on it, and store
the lockmode that's held. We could then also consider getting of a
bunch of fields like boundinfo and nparts from RelOptInfo.

We could also perhaps do a WARNING about lock upgrade hazards in
there, at least maybe in debug builds.

However, I only spent about 10 mins looking into this, there may be
some giant holes in the idea.  It would need much more research.

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: why partition pruning doesn't work?

From
Robert Haas
Date:
On Thu, Jun 14, 2018 at 7:23 AM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> However, I only spent about 10 mins looking into this, there may be
> some giant holes in the idea.  It would need much more research.

It kind of flies in the face of the idea that a RangeTblEntry is just
a node that can be freely copied around, serialized and deserialized,
etc.

I think it would be better to keep the pointer in the RelOptInfo in
the planner and in the EState or PlanState in the executor.  Those are
things we don't think can be copied, serialized, etc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 14, 2018 at 7:23 AM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> However, I only spent about 10 mins looking into this, there may be
>> some giant holes in the idea.  It would need much more research.

> It kind of flies in the face of the idea that a RangeTblEntry is just
> a node that can be freely copied around, serialized and deserialized,
> etc.

And also the idea that the Plan tree is read-only to the executor,
which is not a good property to give up.

> I think it would be better to keep the pointer in the RelOptInfo in
> the planner and in the EState or PlanState in the executor.  Those are
> things we don't think can be copied, serialized, etc.

Yeah, RelOptInfo seems like the natural place in the planner; we might
need index relcache links in IndexOptInfo, too.

I'm less sure what to do in the executor.  We already do keep open
relation pointers in PlanStates; the problem is just that it's
node-type-specific (ss_currentRelation, iss_RelationDesc, etc).  Perhaps
that's unavoidable and we should just add more such fields as needed.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Amit Langote
Date:
On 2018/06/14 23:40, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Thu, Jun 14, 2018 at 7:23 AM, David Rowley
>> <david.rowley@2ndquadrant.com> wrote:
>>> However, I only spent about 10 mins looking into this, there may be
>>> some giant holes in the idea.  It would need much more research.
> 
>> It kind of flies in the face of the idea that a RangeTblEntry is just
>> a node that can be freely copied around, serialized and deserialized,
>> etc.
> 
> And also the idea that the Plan tree is read-only to the executor,
> which is not a good property to give up.
> 
>> I think it would be better to keep the pointer in the RelOptInfo in
>> the planner and in the EState or PlanState in the executor.  Those are
>> things we don't think can be copied, serialized, etc.
> 
> Yeah, RelOptInfo seems like the natural place in the planner; we might
> need index relcache links in IndexOptInfo, too.
> 
> I'm less sure what to do in the executor.  We already do keep open
> relation pointers in PlanStates; the problem is just that it's
> node-type-specific (ss_currentRelation, iss_RelationDesc, etc).  Perhaps
> that's unavoidable and we should just add more such fields as needed.

The patch I mentioned upthread maintains an array of Relation pointers in
AppendState with as many members as there are in the partitioned_rels list
that appears in the corresponding Append plan.

I revised that patch a bit to rename ExecLockNonLeafAppendTables to
ExecOpenAppendPartitionedTables to sound consistent with
ExecOpenScanRelation et al.

Thanks,
Amit

Attachment

Re: why partition pruning doesn't work?

From
Amit Langote
Date:
On 2018/06/14 20:23, David Rowley wrote:
> On 14 June 2018 at 19:17, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
>> I had sent a patch to try to get rid of the open(NoLock) there a couple of
>> months ago [1].  The idea was to both lock and open the relation in
>> ExecNonLeafAppendTables, which is the first time all partitioned tables in
>> a given Append node are locked for execution.  Also, the patch makes it a
>> responsibility of ExecEndAppend to release the relcache pins, so the
>> recently added ExecDestroyPartitionPruneState would not be needed.
> 
> Robert and I briefly discussed something more complete a while back.
> Not much detail was talked about, but the basic idea was to store the
> Relation somewhere in the planner an executor that we could lookup by
> rel index rather than having to relation_open all the time.
> 
> I just had a very quick look around and thought maybe RangeTblEntry
> might be a good spot to store the Relation and current lock level that
> we hold on that relation. This means that we can use
> PlannerInfo->simple_rte_array in the planner and
> EState->es_range_table in the executor. The latter of those would be
> much nicer if we built an array instead of keeping the list (can
> probably build that during InitPlan()).  We could then get hold of a
> Relation object when needed without having to do relation_open(...,
> NoLock).
> 
> Code which currently looks like:
> 
> reloid = getrelid(scanrelid, estate->es_range_table);
> rel = heap_open(reloid, lockmode);
> 
> In places like ExecOpenScanRelation, could be replaced with some
> wrapper function call like: rte_open_rel(estate, scanrelid, lockmode);
> 
> This could also be used to ERROR out if rte_open_rel() was done
> initially with NoLock. Right now there are so many places that we do
> relation_open(..., NoLock) and write a comment /* Lock already taken
> by ... */, which we hope is always true.
> 
> If the Relation is already populated in the RangeTblEntry then the
> function would just return that, otherwise, we'd just look inside the
> RangeTblEntry for the relation Oid and do a heap_open on it, and store
> the lockmode that's held. We could then also consider getting of a
> bunch of fields like boundinfo and nparts from RelOptInfo.
> 
> We could also perhaps do a WARNING about lock upgrade hazards in
> there, at least maybe in debug builds.
> 
> However, I only spent about 10 mins looking into this, there may be
> some giant holes in the idea.  It would need much more research.

Will also need to consider caching of plans, that is of PlannedStmts,
which in turn contain RangeTblEntry nodes.  The idea that we can open a
relation at the beginning or even before planning and keep using the
Relation pointer all the way to the end of execution of a query seems hard
to realize.  As others pointed out, we need to think of planning and
execution as separate phases and will have to open/close relations separately.

Thanks,
Amit



Re: why partition pruning doesn't work?

From
Andrew Dunstan
Date:

On 06/12/2018 07:47 AM, Andrew Dunstan wrote:
>
>
> On 06/11/2018 06:41 PM, Andrew Dunstan wrote:
>>
>>
>> On 06/11/2018 06:24 PM, Tom Lane wrote:
>>>
>>> If we had any buildfarm critters running valgrind on
>>> RELCACHE_FORCE_RELEASE or CLOBBER_CACHE_ALWAYS builds, they'd have
>>> detected use of uninitialized memory here ... but I don't think we have
>>> any.  (The combination of valgrind and CCA would probably be too 
>>> slow to
>>> be practical :-(, though maybe somebody with a fast machine could do
>>> the other thing.)
>>>
>>>
>>
>>
>> I don't have a fast machine, but I do have a slow machine already 
>> running valgrind and not doing much else :-) Let's see how lousyjack 
>> does with -DRELCACHE_FORCE_RELEASE
>>
>>
>
>
> It added about 20% to the run time. That's tolerable, so I'll leave it 
> on.
>
>


OK, lousyjack is back online with this, new and improved. It currently 
takes 7.5 hours for a run.  Should I also add -DCATCACHE_FORCE_RELEASE?

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> OK, lousyjack is back online with this, new and improved. It currently 
> takes 7.5 hours for a run.  Should I also add -DCATCACHE_FORCE_RELEASE?

I did some experimentation yesterday with valgrind plus both
RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE.  I didn't find
any new bugs, but adding CATCACHE_FORCE_RELEASE makes things quite
a lot slower :-(.  Not sure it's worth it.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Andrew Dunstan
Date:

On 06/15/2018 10:02 AM, Tom Lane wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> OK, lousyjack is back online with this, new and improved. It currently
>> takes 7.5 hours for a run.  Should I also add -DCATCACHE_FORCE_RELEASE?
> I did some experimentation yesterday with valgrind plus both
> RELCACHE_FORCE_RELEASE and CATCACHE_FORCE_RELEASE.  I didn't find
> any new bugs, but adding CATCACHE_FORCE_RELEASE makes things quite
> a lot slower :-(.  Not sure it's worth it.
>
>             


OK.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> [ 0001-Open-partitioned-tables-during-Append-initialization.patch ]

I took a look at this.  While I'm in agreement with the general idea of
holding open the partitioned relations' relcache entries throughout the
query, I do not like anything about this patch:

* It seems to be outright broken for the case that any of the partitioned
relations are listed in nonleafResultRelations.  If we're going to do it
like this, we have to open every one of the partrels regardless of that.
(I wonder whether we couldn't lose PlannedStmt.nonleafResultRelations
altogether, in favor of merging the related code in InitPlan with this.
That existing code is already a mighty ugly wart, and this patch makes
it worse by adding new, related warts elsewhere.)

* You've got *far* too much intimate knowledge of the possible callers
in ExecOpenAppendPartitionedTables.

Personally, what I would have this function do is return a List of
the opened Relation pointers, and add a matching function to run through
such a List and close the entries again, and make the callers responsible
for stashing the List pointer in an appropriate field in their planstate.
Or maybe what we should do is drop ExecLockNonLeafAppendTables/
ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Amit Langote
Date:
On 2018/06/19 2:05, Tom Lane wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> [ 0001-Open-partitioned-tables-during-Append-initialization.patch ]
> 
> I took a look at this.  While I'm in agreement with the general idea of
> holding open the partitioned relations' relcache entries throughout the
> query, I do not like anything about this patch:

Thanks for taking a look at it and sorry about the delay in replying.

> * It seems to be outright broken for the case that any of the partitioned
> relations are listed in nonleafResultRelations.  If we're going to do it
> like this, we have to open every one of the partrels regardless of that.

Yes, that was indeed wrong.

> (I wonder whether we couldn't lose PlannedStmt.nonleafResultRelations
> altogether, in favor of merging the related code in InitPlan with this.

Hmm, PlannedStmt.nonleafResultRelations exists for the same reason as why
PlannedStmt.resultRelations does, that is,

/*
 * initialize result relation stuff, and open/lock the result rels.
 *
 * We must do this before initializing the plan tree, else we might try to
 * do a lock upgrade if a result rel is also a source rel.
 */

nonleafResultRelations contains members of partitioned_rels lists of all
ModifyTable nodes contained in a plan.

> That existing code is already a mighty ugly wart, and this patch makes
> it worse by adding new, related warts elsewhere.)

I just realized that there is a thinko in the following piece of code in
ExecLockNonLeafAppendTables

        /* If this is a result relation, already locked in InitPlan */
        foreach(l, stmt->nonleafResultRelations)
        {
            if (rti == lfirst_int(l))
            {
                is_result_rel = true;
                break;
            }
        }

It should actually be:

        /* If this is a result relation, already locked in InitPlan */
        foreach(l, stmt->nonleafResultRelations)
        {
            Index   nonleaf_rti = lfirst_int(l);
            Oid     nonleaf_relid = getrelid(nonleaf_rti,
                                             estate->es_range_table);

            if (relid == nonleaf_relid)
            {
                is_result_rel = true;
                break;
            }
        }

RT indexes in, say, Append.partitioned_rels, are distinct from those in
PlannedStmt.nonleafResultRelations, so the existing test never succeeds,
as also evident from the coverage report:

https://coverage.postgresql.org/src/backend/executor/execUtils.c.gcov.html#864


I'm wondering if we couldn't just get rid of this code.  If an input
partitioned tables is indeed also a result relation, then we would've
locked it in InitPlan with RowExclusiveLock and heap_opening it with a
weaker lock (RowShare/AccessShare) wouldn't hurt.

> * You've got *far* too much intimate knowledge of the possible callers
> in ExecOpenAppendPartitionedTables.
> 
> Personally, what I would have this function do is return a List of
> the opened Relation pointers, and add a matching function to run through
> such a List and close the entries again, and make the callers responsible
> for stashing the List pointer in an appropriate field in their planstate.

OK, I rewrote it to work that way.

> Or maybe what we should do is drop ExecLockNonLeafAppendTables/
> ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it.

Hmm, for InitPlan to do what ExecOpenAppendPartitionedTables does, we'd
have to have all the partitioned tables (contained in partitioned_rels
fields of all Append/MergeAppend/ModifyTable nodes of a plan) also listed
in a global list like rootResultRelations and nonleafResultRelations of
PlannedStmt.

Attached updated patch.

Thanks,
Amit

Attachment

Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> On 2018/06/19 2:05, Tom Lane wrote:
>> Or maybe what we should do is drop ExecLockNonLeafAppendTables/
>> ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it.

> Hmm, for InitPlan to do what ExecOpenAppendPartitionedTables does, we'd
> have to have all the partitioned tables (contained in partitioned_rels
> fields of all Append/MergeAppend/ModifyTable nodes of a plan) also listed
> in a global list like rootResultRelations and nonleafResultRelations of
> PlannedStmt.

> Attached updated patch.

I've been looking at this patch, and while it's not unreasonable on its
own terms, I'm growing increasingly distressed at the ad-hoc and rather
duplicative nature of the data structures that have gotten stuck into
plan trees thanks to partitioning (the rootResultRelations and
nonleafResultRelations lists being prime examples).

It struck me this morning that a whole lot of the complication here is
solely due to needing to identify the right type of relation lock to take
during executor startup, and that THAT WORK IS TOTALLY USELESS.  In every
case, we must already hold a suitable lock before we ever get to the
executor; either one acquired during the parse/plan pipeline, or one
re-acquired by AcquireExecutorLocks in the case of a cached plan.
Otherwise it's entirely possible that the plan has been invalidated by
concurrent DDL --- and it's not the executor's job to detect that and
re-plan; that *must* have been done upstream.

Moreover, it's important from a deadlock-avoidance standpoint that these
locks get acquired in the same order as they were acquired during the
initial parse/plan pipeline.  I think it's reasonable to assume they were
acquired in RTE order in that stage, so AcquireExecutorLocks is OK.  But,
if the current logic in the executor gets them in that order, it's both
non-obvious that it does so and horribly fragile if it does, seeing that
the responsibility for this is split across InitPlan,
ExecOpenScanRelation, and ExecLockNonLeafAppendTables.

So I'm thinking that what we really ought to do here is simplify executor
startup to just open all rels with NoLock, and get rid of any supporting
data structures that turn out to have no other use.  (David Rowley's
nearby patch to create a properly hierarchical executor data structure for
partitioning info is likely to tie into this too, by removing some other
vestigial uses of those lists.)

I think that this idea has been discussed in the past, and we felt at
the time that having the executor take its own locks was a good safety
measure, and a relatively cheap one since the lock manager is pretty
good at short-circuiting duplicative lock requests.  But those are
certainly not free.  Moreover, I'm not sure that this is really a
safety measure at all: if the executor were really taking any lock
not already held, it'd be masking a DDL hazard.

To investigate this further, I made the attached not-meant-for-commit
hack to verify whether InitPlan and related executor startup functions
were actually taking any not-previously-held locks.  I could only find
one such case: the parser always opens tables selected FOR UPDATE with
RowShareLock, but if we end up implementing the resulting row mark
with ROW_MARK_COPY, the executor is acquiring just AccessShareLock
(because ExecOpenScanRelation thinks it needs to acquire some lock).
The patch as presented hacks ExecOpenScanRelation to avoid that, and
it passes check-world.

What we'd be better off doing, if we go this route, is to install an
assertion-build-only test that verifies during relation_open(NoLock)
that some kind of lock is already held on the rel.  That would protect
not only the executor, but a boatload of existing places that open
rels with NoLock on the currently-unverified assumption that a lock is
already held.

I'm also rather strongly tempted to add a field to RangeTblEntry
that records the appropriate lock strength to take, so that we don't
have to rely on keeping AcquireExecutorLocks' logic to decide on the
lock type in sync with whatever the parse/plan pipeline does.  (One
could then imagine adding assertions in the executor that this field
shows a lock strength of at least X, in place of actually opening
the rel with X.)

BTW, there'd be a lot to be said for having InitPlan just open all
the rels and build an array of Relation pointers that parallels the
RTE list, rather than doing heap_opens in random places elsewhere.

            regards, tom lane

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 8026fe2..58c62f3 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -817,6 +817,7 @@ InitPlan(QueryDesc *queryDesc, int eflags)
     TupleDesc    tupType;
     ListCell   *l;
     int            i;
+    bool        gotlock;

     /*
      * Do permissions checks
@@ -852,7 +853,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
             Relation    resultRelation;

             resultRelationOid = getrelid(resultRelationIndex, rangeTable);
-            resultRelation = heap_open(resultRelationOid, RowExclusiveLock);
+            gotlock = LockRelationOid(resultRelationOid, RowExclusiveLock);
+            Assert(!gotlock || IsParallelWorker());
+            resultRelation = heap_open(resultRelationOid, NoLock);

             InitResultRelInfo(resultRelInfo,
                               resultRelation,
@@ -892,7 +895,9 @@ InitPlan(QueryDesc *queryDesc, int eflags)
                 Relation    resultRelDesc;

                 resultRelOid = getrelid(resultRelIndex, rangeTable);
-                resultRelDesc = heap_open(resultRelOid, RowExclusiveLock);
+                gotlock = LockRelationOid(resultRelOid, RowExclusiveLock);
+                Assert(!gotlock || IsParallelWorker());
+                resultRelDesc = heap_open(resultRelOid, NoLock);
                 InitResultRelInfo(resultRelInfo,
                                   resultRelDesc,
                                   lfirst_int(l),
@@ -912,8 +917,11 @@ InitPlan(QueryDesc *queryDesc, int eflags)
                 /* We locked the roots above. */
                 if (!list_member_int(plannedstmt->rootResultRelations,
                                      resultRelIndex))
-                    LockRelationOid(getrelid(resultRelIndex, rangeTable),
-                                    RowExclusiveLock);
+                {
+                    gotlock = LockRelationOid(getrelid(resultRelIndex, rangeTable),
+                                              RowExclusiveLock);
+                    Assert(!gotlock || IsParallelWorker());
+                }
             }
         }
     }
@@ -963,10 +971,14 @@ InitPlan(QueryDesc *queryDesc, int eflags)
             case ROW_MARK_NOKEYEXCLUSIVE:
             case ROW_MARK_SHARE:
             case ROW_MARK_KEYSHARE:
-                relation = heap_open(relid, RowShareLock);
+                gotlock = LockRelationOid(relid, RowShareLock);
+                Assert(!gotlock || IsParallelWorker());
+                relation = heap_open(relid, NoLock);
                 break;
             case ROW_MARK_REFERENCE:
-                relation = heap_open(relid, AccessShareLock);
+                gotlock = LockRelationOid(relid, AccessShareLock);
+                Assert(!gotlock || IsParallelWorker());
+                relation = heap_open(relid, NoLock);
                 break;
             case ROW_MARK_COPY:
                 /* no physical table access is required */
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index b963cae..cf08b50 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -42,6 +42,7 @@

 #include "postgres.h"

+#include "access/parallel.h"
 #include "access/relscan.h"
 #include "access/transam.h"
 #include "executor/executor.h"
@@ -645,6 +646,7 @@ ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags)
     Relation    rel;
     Oid            reloid;
     LOCKMODE    lockmode;
+    bool        gotlock;

     /*
      * Determine the lock type we need.  First, scan to see if target relation
@@ -659,13 +661,19 @@ ExecOpenScanRelation(EState *estate, Index scanrelid, int eflags)
         /* Keep this check in sync with InitPlan! */
         ExecRowMark *erm = ExecFindRowMark(estate, scanrelid, true);

-        if (erm != NULL && erm->relation != NULL)
+        /* HACK: assume things are OK for ROW_MARK_COPY case */
+        if (erm != NULL)
             lockmode = NoLock;
     }

     /* Open the relation and acquire lock as needed */
     reloid = getrelid(scanrelid, estate->es_range_table);
-    rel = heap_open(reloid, lockmode);
+    if (lockmode != NoLock)
+    {
+        gotlock = LockRelationOid(reloid, lockmode);
+        Assert(!gotlock || IsParallelWorker());
+    }
+    rel = heap_open(reloid, NoLock);

     /*
      * Complain if we're attempting a scan of an unscannable relation, except
@@ -874,6 +882,7 @@ ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate)
         Index        rti = lfirst_int(lc);
         bool        is_result_rel = false;
         Oid            relid = getrelid(rti, estate->es_range_table);
+        bool        gotlock;

         /* If this is a result relation, already locked in InitPlan */
         foreach(l, stmt->nonleafResultRelations)
@@ -903,9 +912,10 @@ ExecLockNonLeafAppendTables(List *partitioned_rels, EState *estate)
             }

             if (rc && RowMarkRequiresRowShareLock(rc->markType))
-                LockRelationOid(relid, RowShareLock);
+                gotlock = LockRelationOid(relid, RowShareLock);
             else
-                LockRelationOid(relid, AccessShareLock);
+                gotlock = LockRelationOid(relid, AccessShareLock);
+            Assert(!gotlock || IsParallelWorker());
         }
     }
 }
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index 7b2dcb6..a1ebf64 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -100,8 +100,9 @@ SetLocktagRelationOid(LOCKTAG *tag, Oid relid)
  *
  * Lock a relation given only its OID.  This should generally be used
  * before attempting to open the relation's relcache entry.
+ * Return TRUE if we acquired a new lock, FALSE if already held.
  */
-void
+bool
 LockRelationOid(Oid relid, LOCKMODE lockmode)
 {
     LOCKTAG        tag;
@@ -122,7 +123,11 @@ LockRelationOid(Oid relid, LOCKMODE lockmode)
      * CommandCounterIncrement, not here.)
      */
     if (res != LOCKACQUIRE_ALREADY_HELD)
+    {
         AcceptInvalidationMessages();
+        return true;
+    }
+    return false;
 }

 /*
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index a217de9..69e6f7f 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -37,7 +37,7 @@ typedef enum XLTW_Oper
 extern void RelationInitLockInfo(Relation relation);

 /* Lock a relation */
-extern void LockRelationOid(Oid relid, LOCKMODE lockmode);
+extern bool LockRelationOid(Oid relid, LOCKMODE lockmode);
 extern bool ConditionalLockRelationOid(Oid relid, LOCKMODE lockmode);
 extern void UnlockRelationId(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationOid(Oid relid, LOCKMODE lockmode);

Re: why partition pruning doesn't work?

From
Robert Haas
Date:
On Sun, Jul 15, 2018 at 1:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> What we'd be better off doing, if we go this route, is to install an
> assertion-build-only test that verifies during relation_open(NoLock)
> that some kind of lock is already held on the rel.  That would protect
> not only the executor, but a boatload of existing places that open
> rels with NoLock on the currently-unverified assumption that a lock is
> already held.

+1.  In fact, maybe we ought to go a little further and have a
relation_reopen(oid, mode) that verifies that a lock in the specified
mode is held.

And then maybe we ought to go even further and start trying to get rid
of all the places where we reopen already-opened relations.  A
distressing number of new patches add more places that do that, and
while I try to push back on those, I think they are proliferating, and
I think that they are not free.  Granted, a hash table lookup is
pretty cheap, but if you do a sufficient number of them in
commonlt-taken code paths, it's got to cost something.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: why partition pruning doesn't work?

From
David Rowley
Date:
On 16 July 2018 at 12:12, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Jul 15, 2018 at 1:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> What we'd be better off doing, if we go this route, is to install an
>> assertion-build-only test that verifies during relation_open(NoLock)
>> that some kind of lock is already held on the rel.  That would protect
>> not only the executor, but a boatload of existing places that open
>> rels with NoLock on the currently-unverified assumption that a lock is
>> already held.
>
> +1.  In fact, maybe we ought to go a little further and have a
> relation_reopen(oid, mode) that verifies that a lock in the specified
> mode is held.

Wouldn't it be better to just store the Relation indexed by its relid
somewhere the first time we opened it? Then just do a direct array
lookup on that rather than looking up by hashtable in syscache?

-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: why partition pruning doesn't work?

From
Tom Lane
Date:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On 16 July 2018 at 12:12, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Jul 15, 2018 at 1:02 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> What we'd be better off doing, if we go this route, is to install an
>>> assertion-build-only test that verifies during relation_open(NoLock)
>>> that some kind of lock is already held on the rel.

> Wouldn't it be better to just store the Relation indexed by its relid
> somewhere the first time we opened it? Then just do a direct array
> lookup on that rather than looking up by hashtable in syscache?

That would require carrying said array clear through from the parser to
the executor, plus we'd have some fun keeping it in sync with the RTE
changes that happen in the rewriter and planner, plus it's not entirely
clear where we'd close those relations in cases where the generated
plan isn't fed to the executor immediately (or gets copied in any way).
I don't think it's worth going that far.

I *do* think it might be worth behaving like that within the executor
by itself, and said so upthread.  In that situation, we have a clear
place to do relation closes during ExecutorEnd, so it's not as messy
as a more general approach would be.

            regards, tom lane


Re: why partition pruning doesn't work?

From
Robert Haas
Date:
On Sun, Jul 15, 2018 at 8:24 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
>> +1.  In fact, maybe we ought to go a little further and have a
>> relation_reopen(oid, mode) that verifies that a lock in the specified
>> mode is held.
>
> Wouldn't it be better to just store the Relation indexed by its relid
> somewhere the first time we opened it? Then just do a direct array
> lookup on that rather than looking up by hashtable in syscache?

Yes, that would be better, IMHO anyway.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: why partition pruning doesn't work?

From
Amit Langote
Date:
On 2018/07/16 2:02, Tom Lane wrote:
> Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
>> On 2018/06/19 2:05, Tom Lane wrote:
>>> Or maybe what we should do is drop ExecLockNonLeafAppendTables/
>>> ExecOpenAppendPartitionedTables entirely and teach InitPlan to do it.
> 
>> Hmm, for InitPlan to do what ExecOpenAppendPartitionedTables does, we'd
>> have to have all the partitioned tables (contained in partitioned_rels
>> fields of all Append/MergeAppend/ModifyTable nodes of a plan) also listed
>> in a global list like rootResultRelations and nonleafResultRelations of
>> PlannedStmt.
> 
>> Attached updated patch.
> 
> I've been looking at this patch, and while it's not unreasonable on its
> own terms, I'm growing increasingly distressed at the ad-hoc and rather
> duplicative nature of the data structures that have gotten stuck into
> plan trees thanks to partitioning (the rootResultRelations and
> nonleafResultRelations lists being prime examples).
> 
> It struck me this morning that a whole lot of the complication here is
> solely due to needing to identify the right type of relation lock to take
> during executor startup, and that THAT WORK IS TOTALLY USELESS.

I agree.  IIUC, that's also the reason why PlannedStmt had to grow a
resultRelations field in the first place.

> In every
> case, we must already hold a suitable lock before we ever get to the
> executor; either one acquired during the parse/plan pipeline, or one
> re-acquired by AcquireExecutorLocks in the case of a cached plan.
> Otherwise it's entirely possible that the plan has been invalidated by
> concurrent DDL --- and it's not the executor's job to detect that and
> re-plan; that *must* have been done upstream.
> 
> Moreover, it's important from a deadlock-avoidance standpoint that these
> locks get acquired in the same order as they were acquired during the
> initial parse/plan pipeline.  I think it's reasonable to assume they were
> acquired in RTE order in that stage, so AcquireExecutorLocks is OK.  But,
> if the current logic in the executor gets them in that order, it's both
> non-obvious that it does so and horribly fragile if it does, seeing that
> the responsibility for this is split across InitPlan,
> ExecOpenScanRelation, and ExecLockNonLeafAppendTables.
> 
> So I'm thinking that what we really ought to do here is simplify executor
> startup to just open all rels with NoLock, and get rid of any supporting
> data structures that turn out to have no other use.  (David Rowley's
> nearby patch to create a properly hierarchical executor data structure for
> partitioning info is likely to tie into this too, by removing some other
> vestigial uses of those lists.)

I agree it would be nice to be able to get rid of those lists.

> I think that this idea has been discussed in the past, and we felt at
> the time that having the executor take its own locks was a good safety
> measure, and a relatively cheap one since the lock manager is pretty
> good at short-circuiting duplicative lock requests.  But those are
> certainly not free.  Moreover, I'm not sure that this is really a
> safety measure at all: if the executor were really taking any lock
> not already held, it'd be masking a DDL hazard.
> 
> To investigate this further, I made the attached not-meant-for-commit
> hack to verify whether InitPlan and related executor startup functions
> were actually taking any not-previously-held locks.  I could only find
> one such case: the parser always opens tables selected FOR UPDATE with
> RowShareLock, but if we end up implementing the resulting row mark
> with ROW_MARK_COPY, the executor is acquiring just AccessShareLock
> (because ExecOpenScanRelation thinks it needs to acquire some lock).
> The patch as presented hacks ExecOpenScanRelation to avoid that, and
> it passes check-world.
> 
> What we'd be better off doing, if we go this route, is to install an
> assertion-build-only test that verifies during relation_open(NoLock)
> that some kind of lock is already held on the rel.  That would protect
> not only the executor, but a boatload of existing places that open
> rels with NoLock on the currently-unverified assumption that a lock is
> already held.

+1

> I'm also rather strongly tempted to add a field to RangeTblEntry
> that records the appropriate lock strength to take, so that we don't
> have to rely on keeping AcquireExecutorLocks' logic to decide on the
> lock type in sync with whatever the parse/plan pipeline does.  (One
> could then imagine adding assertions in the executor that this field
> shows a lock strength of at least X, in place of actually opening
> the rel with X.)
> 
> BTW, there'd be a lot to be said for having InitPlan just open all
> the rels and build an array of Relation pointers that parallels the
> RTE list, rather than doing heap_opens in random places elsewhere.

+1 to this.  Actually I had written such a patch in place of the one I
posted on this thread, but wasn't confident enough that I had found and
fixed all the places in the executor that would use the Relation pointer
from there instead of fetching it themselves.

Thanks,
Amit



Re: why partition pruning doesn't work?

From
Tom Lane
Date:
Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> writes:
> On 2018/07/16 2:02, Tom Lane wrote:
>> BTW, there'd be a lot to be said for having InitPlan just open all
>> the rels and build an array of Relation pointers that parallels the
>> RTE list, rather than doing heap_opens in random places elsewhere.

> +1 to this.  Actually I had written such a patch in place of the one I
> posted on this thread, but wasn't confident enough that I had found and
> fixed all the places in the executor that would use the Relation pointer
> from there instead of fetching it themselves.

Well, it doesn't have to be a 100% conversion, especially not to start
out with.  As long as we ensure that common code paths through the
executor have only one open/close per rel, I don't think it matters
too much if there are some corner cases with more.

Also, tracking down places like that wouldn't be that hard to do.  One
could instrument the relcache to report situations where a rel's refcount
gets above 1.  (The infrastructure for logging backtraces that's being
discussed in other threads would help here.)  We might decide that
particular cases weren't worth fixing --- eg if the extra open is
happening in code that doesn't have easy access to the EState --- but
I doubt that finding them is impractical.

            regards, tom lane