Re: Trying to pull up EXPR SubLinks - Mailing list pgsql-hackers

From Dilip Kumar
Subject Re: Trying to pull up EXPR SubLinks
Date
Msg-id CAFiTN-u8Zm5hpt05W89PccaASH68-fN_Aj7pRhg9qgnnGEn54Q@mail.gmail.com
Whole thread Raw
In response to Re: Trying to pull up EXPR SubLinks  (Andy Fan <zhihui.fan1213@gmail.com>)
Responses Re: Trying to pull up EXPR SubLinks
List pgsql-hackers
On Fri, Apr 24, 2020 at 8:56 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> Actually I have a different opinion to handle this issue,  to execute the
> a > (select avg(a) from tinner where x = touer.x);  The drawback of current
> path is because it may calculates the same touer.x value multi-times. So
> if we cache the values we have calculated before, we can avoid the cost.
> Material path may be the one we can reference but it assumes all the tuples
> in the tuplestore matches the input params, which is not the fact here.
>
> But what if the input params doesn't change?  If so we can use Material path
> to optimize this case.  But since we don't know if the if the input params changed
> or not during plan time,  we just add the path (let's assume we can add it with some
> rules or cost calculation).  If the input params is not changed, we use the cached
> values,  if the input params changed,  we can ReScan the Material node.  To optimize
> the the cache invalidation frequent issue like (1, 2, 1, 2, 1, 2) case, we may consider
> a sort path to change the input values to (1, 1, 1, 2, 2, 2).  But overall it is a big effort.
>
> As a independent small optimization maybe if the input params doesn't change, we
> can use the tuples in the Material node again. Suppose it will not demage our current
> framework if we can add the material path by either rules based or cost based.
>
> Suppose we have the following data:
>
> demo=#  select * from j1 limit 10;
>  i  | im5 | im100 | im1000
> ----+-----+-------+--------
>   1 |   1 |     1 |      1
>   2 |   2 |     2 |      2
>   3 |   3 |     3 |      3
>   4 |   4 |     4 |      4
>   5 |   0 |     5 |      5
>   6 |   1 |     6 |      6
>   7 |   2 |     7 |      7
>   8 |   3 |     8 |      8
>   9 |   4 |     9 |      9
>  10 |   0 |    10 |     10
> (10 rows)
>
> totally we have j1 = 10,000,002 rows, the extra 2 rows because we have 3 rows for i=1
> demo=#  select * from j1 where i = 1;
>  i | im5 | im100 | im1000
> ---+-----+-------+--------
>  1 |   1 |     1 |      1
>  1 |   1 |     1 |      1
>  1 |   1 |     1 |      1
> (3 rows)
>
> Then  select * from j1 j1o where im5 = (select avg(im5) from j1 where im5 = j1o.im5) and i = 1;
> will hit our above optimizations. The plan is
>
>                   QUERY PLAN
> -----------------------------------------------
>  Index Scan using j1_idx1 on j1 j1o
>    Index Cond: (i = 1)
>    Filter: ((im5)::numeric < (SubPlan 1))
>    SubPlan 1
>      ->  Materialize
>            ->  Aggregate
>                  ->  Seq Scan on j1
>                        Filter: (im5 = j1o.im5)
> (8 rows)
>
> and the Aggregate is just executed once (execution time dropped from 8.x s
> to 2.6s).
>
> ----
> The attached is a very PoC patch,  but it can represent my idea for
> current discuss, Some notes about the implementation.
>
> 1.  We need to check if the input params is really not changed.  Currently I just
> comment it out for quick test.
>
> -               planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
> +               // planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
>
> Looks we have a lot of places to add a params
> to chgParam without checking the actual value. The place I found this case is
> during ExecNestLoop.  So we may need a handy and efficient way to do the
> check for all the places. However it is not a must for current case
>
> 2.   I probably misunderstand the the usage of MaterialState->eflags.  since I don't
> know why the eflag need to be checked ExecMaterial.  and I have to remove it to
> let my PoC work.
>
> -       if (tuplestorestate == NULL && node->eflags != 0)
> +       if (tuplestorestate == NULL)
>
>
> 3.  I added the material path in a very hacked way, the if check  just to make
> sure it take effect on my test statement only.  If you want to test this patch locally,
> you need to change the oid for your case.
>
> +       if (linitial_node(RangeTblEntry, root->parse->rtable)->relid == 25634)
> +               best_path = (Path *) create_material_path(final_rel, best_path);

Can we just directly add the material path on top of the best path?  I
mean there are possibilities that we might not get any benefit of the
material because there is no duplicate from the outer node but we are
paying the cost of materialization right?   The correct idea would be
that we should select this based on the cost comparison.  Basically,
we can consider how many duplicates we have from the outer table
variable no?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Amit Langote
Date:
Subject: Re: Parallel Append can break run-time partition pruning
Next
From: Juan José Santamaría Flecha
Date:
Subject: Re: PG compilation error with Visual Studio 2015/2017/2019