Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series) - Mailing list pgsql-hackers

From Andy Fan
Subject Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
Date
Msg-id CAKU4AWoP6aJ+4GHzXm=aCFwiO2o_4Q0dpxw58ad0qwnskT=9NA@mail.gmail.com
Whole thread Raw
In response to Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)  (Andy Fan <zhihui.fan1213@gmail.com>)
Responses Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
List pgsql-hackers
Hi:

I have finished the parts for baserel, joinrel, subquery, distinctrel. I think
the hardest ones have been verified.  Here is the design overview.

1. Use EC instead of expr to cover more UniqueKey cases.
2. Redesign the UniqueKey as below:

@@ -246,6 +246,7 @@ struct PlannerInfo

List   *eq_classes; /* list of active EquivalenceClasses */
+ List   *unique_exprs; /* List of unique expr */

  bool ec_merging_done; /* set true once ECs are canonical */

+typedef struct UniqueKey
+{
+ NodeTag type;
+ Bitmapset *unique_expr_indexes;
+ bool multi_nulls;
+} UniqueKey;
+

PlannerInfo.unique_exprs is a List of unique exprs.  Unique Exprs is a set of
EquivalenceClass. for example:

CREATE TABLE T1(A INT NOT NULL, B INT NOT NULL, C INT,  pk INT primary key);
CREATE UNIQUE INDEX ON t1(a, b);

SELECT DISTINCT * FROM T1 WHERE a = c;

Then we would have PlannerInfo.unique_exprs as below
[
[EC(a, c), EC(b)],
[EC(pk)]
]

RelOptInfo(t1) would have 2 UniqueKeys.
UniqueKey1 {unique_expr_indexes=bms{0}, multinull=false]
UniqueKey2 {unique_expr_indexes=bms{1}, multinull=false]

The design will benefit many table joins cases. For instance a 10- tables join,
each table has a primary key (a, b).  Then we would have a UniqueKey like
this.

JoinRel{1,2,3,4} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b}
JoinRel{1,2,3,4,5} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b t5.a t5.b}

When more tables are joined, the list would be longer and longer, build the list
consumes both CPU cycles and memory.

With the method as above, we can store it as:

root->unique_exprs =  /* All the UniqueKey is stored once */
[
[t1.a, t1.b], -- EC is ignored in document.
[t2.a, t2.b],
[t3.a, t3.b],
[t4.a, t4.b],
[t5.a, t5.b],
[t6.a, t6.b],
[t7.a, t7.b],
[t8.a, t8.b],
[t9.a, t9.b],
[t10.a, t10.b],
]

JoinRel{1,2,3,4} -  Bitmapset{0,1,2,3} -- one bitmapword.
JoinRel{1,2,3,4,5} -  Bitmapset{0,1,2,3,4} -- one bitmapword.

The member in the bitmap is the index of root->unique_exprs rather than
root->eq_classes because we need to store the SingleRow case in
root->unqiue_exprs as well.

3. Define a new SingleRow node and use it in joinrel as well.

+typedef struct SingleRow
+{
+ NodeTag type;
+ Index relid;
+} SingleRow;

SELECT * FROM t1, t2 WHERE t2.pk = 1;

root->unique_exprs
[
[t1.a, t1.b],
SingleRow{relid=2}
]

JoinRel{t1} - Bitmapset{0}
JoinRel{t2} - Bitmapset{1}

JoinRelt{1, 2} Bitmapset{0, 1} -- SingleRow will never be expanded to dedicated
exprs.

4. Interesting UniqueKey to remove the Useless UniqueKey as soon as possible.

The overall idea is similar with PathKey, I distinguish the UniqueKey for
distinct purpose and useful_for_merging purpose.

SELECT DISTINCT pk FROM  t; -- OK, maintain it all the time, just like
root->query_pathkey.

SELECT DISTINCT t2.c FROM t1, t2 WHERE t1.d = t2.pk; -- T2's UniqueKey PK is
use before t1 join t2, but not useful after it.

Currently the known issue I didn't pass the "interesting UniqueKey" info to
subquery well [2], I'd like to talk more about this when we discuss about
UnqiueKey on subquery part.


5. relation_is_distinct_for

Now I design the function as

+ bool
+ relation_is_distinct_for(PlannerInfo *root, RelOptInfo *rel, List
  *distinct_pathkey)

It is "List *distinct_pathkey", rather than "List *exprs". The reason pathkey
has EC in it, and all the UniqueKey has EC as well. if so, the subset-like
checking is very effective.  As for the distinct/group as no-op case, we have
pathkey all the time. The only drawback of it is some clauses are not-sortable,
in this case, the root->distinct_pathkey and root->group_pathkey is not
set. However it should be rare by practice, so I ignore this part for
now. After all, I can have a relation_is_disticnt_for version for Exprs. I just
not implemented it so far.

6. EC overhead in UnqiueKey & UNION case.

Until now I didn't create any new EC for the UniqueKey case, I just used the
existing ones. However I can't handle the case like

SELECT a, b FROM t1
UNION
SELECT x, y FROM t2;

In this case, there is no EC created with existing code. and I don't want to
create them for the UniqueKey case as well.  so my plan is just not to handle
the case for UNION.

Since we need some big effort from the reviewer, I split the patch into many
smaller chunks.

Patch 1 / Patch 2:  I just split some code which UniqueKey uses but not very
interrelated. Splitting them out to reduce the core patch size.
Patch 3:  still the notnull stuff.  This one doesn't play a big role overall,
even if the design is changed at last, we can just modify some small stuff. IMO,
I don't think it is a blocker issue to move on.
Patch 4:  Support the UniqueKey for baserel.
Patch 5: Support the UniqueKey for joinrel.
Patch 6: Support the UniqueKey for some upper relation, like distinctrel,
groupby rel.

I'd suggest moving on like this:
1. Have an overall review to see if any outstanding issues the patch has.
2. If not, we can review and commit patch 1 & patch 2 to reduce the patch size.
3. Decide which method to take for not null stuff. IMO Tom's method
would be a bit
   complex and the benefit is not very clear to me[1].  So the choices
   include: a). move on UniqueKey stuff until Tom's method is ready. b). Move
   on the UniqueKey with my notnull way, and changes to Tom's method when
   necessary. I prefer method b).
4. Review & Commit the UniqueKey for BaseRel part.
5. Review & Commit the UniqueKey for JoinRel part.
6. Review & Commit the UniqueKey for SubQuery part *without* the Interesting
   UniqueKey well handled.
7. Review & Commit the UniqueKey for SubQuery part *with* the Interesting
   UniqueKey well handled.
8. Discuss about the UniqueKey on partitioned rel case and develop/review/commit
   it.
9. Apply UniqueKey stuff on more user cases rather than DISTINCT.

What do you think about this?

[1] https://www.postgresql.org/message-id/CAApHDvo5b2pYX%2BdbPy%2BysGBSarezRSfXthX32TZNFm0wPdfKGQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAKU4AWo6-%3D9mg3UQ5UJhGCMw6wyTPyPGgV5oh6dFvwEN%3D%2Bhb_w%40mail.gmail.com


Thanks

Attachment

pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: CI/windows docker vs "am a service" autodetection on windows
Next
From: Justin Pryzby
Date:
Subject: Re: prion failed with ERROR: missing chunk number 0 for toast value 14334 in pg_toast_2619