Re: Trying to pull up EXPR SubLinks - Mailing list pgsql-hackers

From Andy Fan
Subject Re: Trying to pull up EXPR SubLinks
Date
Msg-id CAKU4AWqLB5FbsvZG1uQ8cSMXF-qcHmO9zQHHFw35J6CaqYAGrQ@mail.gmail.com
Whole thread Raw
In response to Re: Trying to pull up EXPR SubLinks  (Richard Guo <guofenglinux@gmail.com>)
Responses Re: Trying to pull up EXPR SubLinks
Re: Trying to pull up EXPR SubLinks
List pgsql-hackers
Actually I have a different opinion to handle this issue,  to execute the
a > (select avg(a) from tinner where x = touer.x);  The drawback of current 
path is because it may calculates the same touer.x value multi-times. So
if we cache the values we have calculated before, we can avoid the cost. 
Material path may be the one we can reference but it assumes all the tuples
in the tuplestore matches the input params, which is not the fact here. 

But what if the input params doesn't change?  If so we can use Material path
to optimize this case.  But since we don't know if the if the input params changed
or not during plan time,  we just add the path (let's assume we can add it with some
rules or cost calculation).  If the input params is not changed, we use the cached
values,  if the input params changed,  we can ReScan the Material node.  To optimize
the the cache invalidation frequent issue like (1, 2, 1, 2, 1, 2) case, we may consider
a sort path to change the input values to (1, 1, 1, 2, 2, 2).  But overall it is a big effort.

As a independent small optimization maybe if the input params doesn't change, we
can use the tuples in the Material node again. Suppose it will not demage our current
framework if we can add the material path by either rules based or cost based. 

Suppose we have the following data:

demo=#  select * from j1 limit 10;
 i  | im5 | im100 | im1000
----+-----+-------+--------
  1 |   1 |     1 |      1
  2 |   2 |     2 |      2
  3 |   3 |     3 |      3
  4 |   4 |     4 |      4
  5 |   0 |     5 |      5
  6 |   1 |     6 |      6
  7 |   2 |     7 |      7
  8 |   3 |     8 |      8
  9 |   4 |     9 |      9
 10 |   0 |    10 |     10
(10 rows)

totally we have j1 = 10,000,002 rows, the extra 2 rows because we have 3 rows for i=1 
demo=#  select * from j1 where i = 1;
 i | im5 | im100 | im1000
---+-----+-------+--------
 1 |   1 |     1 |      1
 1 |   1 |     1 |      1
 1 |   1 |     1 |      1
(3 rows)

Then  select * from j1 j1o where im5 = (select avg(im5) from j1 where im5 = j1o.im5) and i = 1;
will hit our above optimizations. The plan is

                  QUERY PLAN
-----------------------------------------------
 Index Scan using j1_idx1 on j1 j1o
   Index Cond: (i = 1)
   Filter: ((im5)::numeric < (SubPlan 1))
   SubPlan 1
     ->  Materialize
           ->  Aggregate
                 ->  Seq Scan on j1
                       Filter: (im5 = j1o.im5)
(8 rows)

and the Aggregate is just executed once (execution time dropped from 8.x s 
to 2.6s). 

----
The attached is a very PoC patch,  but it can represent my idea for 
current discuss, Some notes about the implementation. 

1.  We need to check if the input params is really not changed.  Currently I just
comment it out for quick test. 

-               planstate->chgParam = bms_add_member(planstate->chgParam, paramid);
+               // planstate->chgParam = bms_add_member(planstate->chgParam, paramid);

Looks we have a lot of places to add a params 
to chgParam without checking the actual value. The place I found this case is
during ExecNestLoop.  So we may need a handy and efficient way to do the
check for all the places. However it is not a must for current case

2.   I probably misunderstand the the usage of MaterialState->eflags.  since I don't
know why the eflag need to be checked ExecMaterial.  and I have to remove it to 
let my PoC work. 

-       if (tuplestorestate == NULL && node->eflags != 0)
+       if (tuplestorestate == NULL)


3.  I added the material path in a very hacked way, the if check  just to make 
sure it take effect on my test statement only.  If you want to test this patch locally,
you need to change the oid for your case. 

+       if (linitial_node(RangeTblEntry, root->parse->rtable)->relid == 25634)
+               best_path = (Path *) create_material_path(final_rel, best_path);

But when we take this action to production case, how to cost this strategy is 
challenge since it can neither reduce the total_cost nor result in a new PathKey.
I will check other place to see how this kind can be added.


Best Regards
Andy Fan


Attachment

pgsql-hackers by date:

Previous
From: Atsushi Torikoshi
Date:
Subject: Re: [Proposal] Add accumulated statistics for wait event
Next
From: Prabhat Sahu
Date:
Subject: Re: [Proposal] Global temporary tables