[PATCH] Keeps tracking the uniqueness with UniqueKey - Mailing list pgsql-hackers
From | Andy Fan |
---|---|
Subject | [PATCH] Keeps tracking the uniqueness with UniqueKey |
Date | |
Msg-id | CAKU4AWrwZMAL=uaFUDMf4WGOVkEL3ONbatqju9nSXTUucpp_pw@mail.gmail.com Whole thread Raw |
Responses |
Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
|
List | pgsql-hackers |
Greetings.
This thread is a follow-up thread for [1], where I submit a patch for erasing the
A new field named uniquekeys was added in RelOptInfo struct, which is a
positions is a list of the sequence no. of the exprs in the current RelOptInfo,
This thread is a follow-up thread for [1], where I submit a patch for erasing the
distinct node if we have known the data is unique for sure. But since the
implementation has changed a lot from the beginning and they are not very
related, so I start this new thread to discuss the new strategy to save the time
of reviewers.
As I said above, my original intention is just used to erase the distinct clause,
As I said above, my original intention is just used to erase the distinct clause,
then Tom Lane suggested function query_is_distinct_for, I found the uniqueness
can be used for costing, remove_useless_join, reduce_unqiue_semijoins.
David suggested to maintain the uniqueness from bottom to top, like join
& subqueries, group-by, distinct, union and so on(we call it as UniqueKey).
Ideally the uniqueness will be be lost in any case. This current implementation
follows the David's suggestion and also thanks Ashutosh who reminded me
the cost should be ok while I had concerns of this at the beginning.
A new field named uniquekeys was added in RelOptInfo struct, which is a
list of UniqueKey struct.
typedef struct UniqueKey
{
NodeTag type;
List *exprs;
List *positions;
bool grantee;
} UniqueKey;
exprs is a list of exprs which is unique if we don't care about the null vaues on
typedef struct UniqueKey
{
NodeTag type;
List *exprs;
List *positions;
bool grantee;
} UniqueKey;
exprs is a list of exprs which is unique if we don't care about the null vaues on
current RelOptInfo.
positions is a list of the sequence no. of the exprs in the current RelOptInfo,
which is used for SubQuery. like
create table t1 (a int primary key, b int);
create table t2 (a int primary key, b int);
select .. from t1, (select b from t2 group by t2) t2 ..;
The UniqueKey for the subquery will be Var(varno=1, varattno=2), but for the
create table t1 (a int primary key, b int);
create table t2 (a int primary key, b int);
select .. from t1, (select b from t2 group by t2) t2 ..;
The UniqueKey for the subquery will be Var(varno=1, varattno=2), but for the
top query, the UniqueKey of t2 should be Var(varno=2, varattrno=1), the 1 here
need to be calculated by UnqiueKey->positions.
Some known issues needs attentions:
1. I didn't check the collation at the whole stage, one reason is the relation_has_unique_index_for
Kindly waiting for your feedback, Thanks you!
grantee field is introduced mainly for remove_useless_join & reduce_unique_semijions.
Take the above case for example:
-- b is nullable. so select b from t2 still can result in duplicated rows.
create unique index t2_uk_b on t2(b);
-- b is nullable. so select b from t2 still can result in duplicated rows.
create unique index t2_uk_b on t2(b);
-- the left join still can be removed since t2.b is a unique index and the nullable
doesn't matter here.
select t1.* from t1 left join t2 on (t1.b = t2.b);
so t2.b will still be an UniqueKey for t2, just that the grantee = false.
A branch of functions like populate_xxx_unqiuekeys for manipulating uniquekeys
so t2.b will still be an UniqueKey for t2, just that the grantee = false.
A branch of functions like populate_xxx_unqiuekeys for manipulating uniquekeys
for a lot of cases, xxx maybe baserel, joinrel, paritioned table, unionrel, groupbyrel,
distincrel and so on. partitioned table has some not obviously troubles due to
users can create index on the childrel directly and differently. You can check
the comments of the code for details.
When maintaining the uniquekeys of joinrel, we have a rule that if both rels have
When maintaining the uniquekeys of joinrel, we have a rule that if both rels have
UniqueKeys, then any combination from the 2 sides is a unqiquekey of the joinrel.
I used two algorithms to keep the length of the UniqueKeys short. One is we only
add useful UniqueKey to the RelOptInfo.uniquekeys. If the expr isn't shown in
rel->reltargets->exprs, it will not be used for others, so we can ignore it safely.
The another one is if column sets A is unqiuekey already, any superset of A
will no need to be added as an UnqiueKey.
The overall cost of the maintaining unqiuekeys should be ok. If you check the code,
The overall cost of the maintaining unqiuekeys should be ok. If you check the code,
you may find there are many 2 or 3 levels foreach, but most of them are started with
unique index, and I used UnqiueKeyContext and SubqueryUnqiueKeyContext in joinrel
and subquery case to avoid too many loops.
Now I have used the UnqiueKey to erase the unnecessary distinct/group by, and also changed
Now I have used the UnqiueKey to erase the unnecessary distinct/group by, and also changed
the rel_is_distinct_for to use UnqiueKeys. so we can handle more cases.
create table m1 (a int primary key, b int, c int);
create table m2 (a int primary key, b int, c int);
create table m3 (a int primary key, b int, c int);
Wit the current patch, we can get:
task3=# explain select t1.a from m3 t1 left join (select m1.a from m1, m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a);
QUERY PLAN
---------------------------------------------------------
Seq Scan on m3 t1 (cost=0.00..32.60 rows=2260 width=4)
Before the patch, we will get:
postgres=# explain select t1.a from m3 t1 left join (select m1.a from m1, m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a)
postgres-# ;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Hash Left Join (cost=0.39..41.47 rows=2260 width=4)
Hash Cond: (t1.a = m1.a)
-> Seq Scan on m3 t1 (cost=0.00..32.60 rows=2260 width=4)
-> Hash (cost=0.37..0.37 rows=1 width=4)
-> Limit (cost=0.15..0.36 rows=1 width=4)
-> Nested Loop (cost=0.15..470.41 rows=2260 width=4)
-> Seq Scan on m1 (cost=0.00..32.60 rows=2260 width=8)
-> Index Only Scan using m2_pkey on m2 (cost=0.15..0.19 rows=1 width=4)
Index Cond: (a = m1.b)
The "limit 1" here is just want to avoid the pull_up_subquery to pull up the subquery,
create table m1 (a int primary key, b int, c int);
create table m2 (a int primary key, b int, c int);
create table m3 (a int primary key, b int, c int);
Wit the current patch, we can get:
task3=# explain select t1.a from m3 t1 left join (select m1.a from m1, m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a);
QUERY PLAN
---------------------------------------------------------
Seq Scan on m3 t1 (cost=0.00..32.60 rows=2260 width=4)
Before the patch, we will get:
postgres=# explain select t1.a from m3 t1 left join (select m1.a from m1, m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a)
postgres-# ;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Hash Left Join (cost=0.39..41.47 rows=2260 width=4)
Hash Cond: (t1.a = m1.a)
-> Seq Scan on m3 t1 (cost=0.00..32.60 rows=2260 width=4)
-> Hash (cost=0.37..0.37 rows=1 width=4)
-> Limit (cost=0.15..0.36 rows=1 width=4)
-> Nested Loop (cost=0.15..470.41 rows=2260 width=4)
-> Seq Scan on m1 (cost=0.00..32.60 rows=2260 width=8)
-> Index Only Scan using m2_pkey on m2 (cost=0.15..0.19 rows=1 width=4)
Index Cond: (a = m1.b)
The "limit 1" here is just want to avoid the pull_up_subquery to pull up the subquery,
I think we may still have opportunities to improve this further if we check if we can
remove a join *just before we join 2 relations*. we may have the similar situation
for reduce_unique_semijions joins. After the changes has been done, we can remove
the "limit 1" here to show the diffidence. I didn't include this change in current patch
since I think the effort may be not small and I want to keep this patch simple.
Some known issues needs attentions:
1. I didn't check the collation at the whole stage, one reason is the relation_has_unique_index_for
doesn't check it as well. The other reason if a in collation A is unique, and then A in collation B is
unique as well, we can ignore it. [2]
2. Current test case contrib/postgres_fdw/sql/postgres_fdw.sql is still failed. I am not sure if
the bug is in my patch or not.
Kindly waiting for your feedback, Thanks you!
Best regards
Andy Fan
Attachment
pgsql-hackers by date: