Home > mailing lists

Thread: [PATCH] Keeps tracking the uniqueness with UniqueKey

[PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

23 March 2020, 10:21:38

Greetings.

This thread is a follow-up thread for [1], where I submit a patch for erasing the

distinct node if we have known the data is unique for sure. But since the

implementation has changed a lot from the beginning and they are not very

related, so I start this new thread to discuss the new strategy to save the time

of reviewers.

As I said above, my original intention is just used to erase the distinct clause,

then Tom Lane suggested function query_is_distinct_for, I found the uniqueness

can be used for costing, remove_useless_join, reduce_unqiue_semijoins.

David suggested to maintain the uniqueness from bottom to top, like join

& subqueries, group-by, distinct, union and so on(we call it as UniqueKey).

Ideally the uniqueness will be be lost in any case. This current implementation

follows the David's suggestion and also thanks Ashutosh who reminded me

the cost should be ok while I had concerns of this at the beginning.

A new field named uniquekeys was added in RelOptInfo struct, which is a

list of UniqueKey struct.

typedef struct UniqueKey
{
NodeTag type;
List *exprs;
List *positions;
bool grantee;
} UniqueKey;

exprs is a list of exprs which is unique if we don't care about the null vaues on

current RelOptInfo.

positions is a list of the sequence no. of the exprs in the current RelOptInfo,

which is used for SubQuery. like

create table t1 (a int primary key, b int);
create table t2 (a int primary key, b int);
select .. from t1, (select b from t2 group by t2) t2 ..;

The UniqueKey for the subquery will be Var(varno=1, varattno=2), but for the

top query, the UniqueKey of t2 should be Var(varno=2, varattrno=1), the 1 here

need to be calculated by UnqiueKey->positions.

grantee field is introduced mainly for remove_useless_join & reduce_unique_semijions.

Take the above case for example:

-- b is nullable. so select b from t2 still can result in duplicated rows.
create unique index t2_uk_b on t2(b);

-- the left join still can be removed since t2.b is a unique index and the nullable

doesn't matter here.

select t1.* from t1 left join t2 on (t1.b = t2.b);

so t2.b will still be an UniqueKey for t2, just that the grantee = false.

A branch of functions like populate_xxx_unqiuekeys for manipulating uniquekeys

for a lot of cases, xxx maybe baserel, joinrel, paritioned table, unionrel, groupbyrel,

distincrel and so on. partitioned table has some not obviously troubles due to

users can create index on the childrel directly and differently. You can check

the comments of the code for details.

When maintaining the uniquekeys of joinrel, we have a rule that if both rels have

UniqueKeys, then any combination from the 2 sides is a unqiquekey of the joinrel.

I used two algorithms to keep the length of the UniqueKeys short. One is we only

add useful UniqueKey to the RelOptInfo.uniquekeys. If the expr isn't shown in

rel->reltargets->exprs, it will not be used for others, so we can ignore it safely.

The another one is if column sets A is unqiuekey already, any superset of A

will no need to be added as an UnqiueKey.

The overall cost of the maintaining unqiuekeys should be ok. If you check the code,

you may find there are many 2 or 3 levels foreach, but most of them are started with

unique index, and I used UnqiueKeyContext and SubqueryUnqiueKeyContext in joinrel

and subquery case to avoid too many loops.

Now I have used the UnqiueKey to erase the unnecessary distinct/group by, and also changed

the rel_is_distinct_for to use UnqiueKeys. so we can handle more cases.

create table m1 (a int primary key, b int, c int);
create table m2 (a int primary key, b int, c int);
create table m3 (a int primary key, b int, c int);

Wit the current patch, we can get:
task3=# explain select t1.a from m3 t1 left join (select m1.a from m1, m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a);
QUERY PLAN
---------------------------------------------------------
Seq Scan on m3 t1 (cost=0.00..32.60 rows=2260 width=4)

Before the patch, we will get:
postgres=# explain select t1.a from m3 t1 left join (select m1.a from m1, m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a)
postgres-# ;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Hash Left Join (cost=0.39..41.47 rows=2260 width=4)
Hash Cond: (t1.a = m1.a)
-> Seq Scan on m3 t1 (cost=0.00..32.60 rows=2260 width=4)
-> Hash (cost=0.37..0.37 rows=1 width=4)
-> Limit (cost=0.15..0.36 rows=1 width=4)
-> Nested Loop (cost=0.15..470.41 rows=2260 width=4)
-> Seq Scan on m1 (cost=0.00..32.60 rows=2260 width=8)
-> Index Only Scan using m2_pkey on m2 (cost=0.15..0.19 rows=1 width=4)
Index Cond: (a = m1.b)

The "limit 1" here is just want to avoid the pull_up_subquery to pull up the subquery,

I think we may still have opportunities to improve this further if we check if we can

remove a join *just before we join 2 relations*. we may have the similar situation

for reduce_unique_semijions joins. After the changes has been done, we can remove

the "limit 1" here to show the diffidence. I didn't include this change in current patch

since I think the effort may be not small and I want to keep this patch simple.

Some known issues needs attentions:
1. I didn't check the collation at the whole stage, one reason is the relation_has_unique_index_for

doesn't check it as well. The other reason if a in collation A is unique, and then A in collation B is

unique as well, we can ignore it. [2]

2. Current test case contrib/postgres_fdw/sql/postgres_fdw.sql is still failed. I am not sure if

the bug is in my patch or not.

Kindly waiting for your feedback, Thanks you!

[1] https://www.postgresql.org/message-id/flat/CAKU4AWqOORqW900O-%2BL4L2%2B0xknsEqpfcs9FF7SeiO9TmpeZOg%40mail.gmail.com#f5d97cc66b9cd330add2fbb004a4d107

[2] https://www.postgresql.org/message-id/CAKU4AWqOORqW900O-%2BL4L2%2B0xknsEqpfcs9FF7SeiO9TmpeZOg%40mail.gmail.com

Best regards

Andy Fan

Attachment

v1-0001-Maintain-the-uniqueness-of-a-Query-from-bottom-to.patch

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

25 March 2020, 14:24:42

On Mon, Mar 23, 2020 at 6:21 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Greetings.

This thread is a follow-up thread for [1], where I submit a patch for erasing the
distinct node if we have known the data is unique for sure. But since the
implementation has changed a lot from the beginning and they are not very
related, so I start this new thread to discuss the new strategy to save the time
of reviewers.

As I said above, my original intention is just used to erase the distinct clause,
then Tom Lane suggested function query_is_distinct_for, I found the uniqueness
can be used for costing, remove_useless_join, reduce_unqiue_semijoins.
David suggested to maintain the uniqueness from bottom to top, like join
& subqueries, group-by, distinct, union and so on(we call it as UniqueKey).
Ideally the uniqueness will be be lost in any case. This current implementation
follows the David's suggestion and also thanks Ashutosh who reminded me
the cost should be ok while I had concerns of this at the beginning.

A new field named uniquekeys was added in RelOptInfo struct, which is a
list of UniqueKey struct.

typedef struct UniqueKey
{
NodeTag type;
List *exprs;
List *positions;
bool grantee;
} UniqueKey;

exprs is a list of exprs which is unique if we don't care about the null vaues on
current RelOptInfo.

positions is a list of the sequence no. of the exprs in the current RelOptInfo,
which is used for SubQuery. like

create table t1 (a int primary key, b int);
create table t2 (a int primary key, b int);
select .. from t1, (select b from t2 group by t2) t2 ..;

The UniqueKey for the subquery will be Var(varno=1, varattno=2), but for the
top query, the UniqueKey of t2 should be Var(varno=2, varattrno=1), the 1 here
need to be calculated by UnqiueKey->positions.

grantee field is introduced mainly for remove_useless_join & reduce_unique_semijions.
Take the above case for example:

-- b is nullable. so select b from t2 still can result in duplicated rows.
create unique index t2_uk_b on t2(b);

-- the left join still can be removed since t2.b is a unique index and the nullable
doesn't matter here.
select t1.* from t1 left join t2 on (t1.b = t2.b);

so t2.b will still be an UniqueKey for t2, just that the grantee = false.

A branch of functions like populate_xxx_unqiuekeys for manipulating uniquekeys
for a lot of cases, xxx maybe baserel, joinrel, paritioned table, unionrel, groupbyrel,
distincrel and so on. partitioned table has some not obviously troubles due to
users can create index on the childrel directly and differently. You can check
the comments of the code for details.

When maintaining the uniquekeys of joinrel, we have a rule that if both rels have
UniqueKeys, then any combination from the 2 sides is a unqiquekey of the joinrel.
I used two algorithms to keep the length of the UniqueKeys short. One is we only
add useful UniqueKey to the RelOptInfo.uniquekeys. If the expr isn't shown in
rel->reltargets->exprs, it will not be used for others, so we can ignore it safely.
The another one is if column sets A is unqiuekey already, any superset of A
will no need to be added as an UnqiueKey.

The overall cost of the maintaining unqiuekeys should be ok. If you check the code,
you may find there are many 2 or 3 levels foreach, but most of them are started with
unique index, and I used UnqiueKeyContext and SubqueryUnqiueKeyContext in joinrel
and subquery case to avoid too many loops.

Now I have used the UnqiueKey to erase the unnecessary distinct/group by, and also changed
the rel_is_distinct_for to use UnqiueKeys. so we can handle more cases.

create table m1 (a int primary key, b int, c int);
create table m2 (a int primary key, b int, c int);
create table m3 (a int primary key, b int, c int);

Wit the current patch, we can get:
task3=# explain select t1.a from m3 t1 left join (select m1.a from m1, m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a);
QUERY PLAN
---------------------------------------------------------
Seq Scan on m3 t1 (cost=0.00..32.60 rows=2260 width=4)

Before the patch, we will get:
postgres=# explain select t1.a from m3 t1 left join (select m1.a from m1, m2 where m1.b = m2.a limit 1) t2 on (t1.a = t2.a)
postgres-# ;
QUERY PLAN
-----------------------------------------------------------------------------------------------
Hash Left Join (cost=0.39..41.47 rows=2260 width=4)
Hash Cond: (t1.a = m1.a)
-> Seq Scan on m3 t1 (cost=0.00..32.60 rows=2260 width=4)
-> Hash (cost=0.37..0.37 rows=1 width=4)
-> Limit (cost=0.15..0.36 rows=1 width=4)
-> Nested Loop (cost=0.15..470.41 rows=2260 width=4)
-> Seq Scan on m1 (cost=0.00..32.60 rows=2260 width=8)
-> Index Only Scan using m2_pkey on m2 (cost=0.15..0.19 rows=1 width=4)
Index Cond: (a = m1.b)

The "limit 1" here is just want to avoid the pull_up_subquery to pull up the subquery,
I think we may still have opportunities to improve this further if we check if we can
remove a join *just before we join 2 relations*. we may have the similar situation
for reduce_unique_semijions joins. After the changes has been done, we can remove
the "limit 1" here to show the diffidence. I didn't include this change in current patch
since I think the effort may be not small and I want to keep this patch simple.

Some known issues needs attentions:
1. I didn't check the collation at the whole stage, one reason is the relation_has_unique_index_for
doesn't check it as well. The other reason if a in collation A is unique, and then A in collation B is
unique as well, we can ignore it. [2]
2. Current test case contrib/postgres_fdw/sql/postgres_fdw.sql is still failed. I am not sure if
the bug is in my patch or not.

Kindly waiting for your feedback, Thanks you!

[1] https://www.postgresql.org/message-id/flat/CAKU4AWqOORqW900O-%2BL4L2%2B0xknsEqpfcs9FF7SeiO9TmpeZOg%40mail.gmail.com#f5d97cc66b9cd330add2fbb004a4d107
[2] https://www.postgresql.org/message-id/CAKU4AWqOORqW900O-%2BL4L2%2B0xknsEqpfcs9FF7SeiO9TmpeZOg%40mail.gmail.com

Just update the patch which do some test case changes.

1. add "ANALYZE" command before running the explain.

2. order by with an explicit collate settings.

3. As for the postgres_fdw.sql, I just copied the results.out to expected.out,

that's should be correct based on the result. However I added my comment

around that.

Now suppose the cbfot should pass this time.

Best Regards.

Andy Fan

Attachment

v2-0001-Maintain-the-uniqueness-of-a-Query-from-bottom-to.patch

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

26 March 2020, 01:55:35

Because I replied the old thread, cfbot run a test based on the old patch

on that thread. I have detached the old thread from commitfest. Reply this

email again to wake up Mr. cfbot with the right information.

Attachment

v2-0001-Maintain-the-uniqueness-of-a-Query-from-bottom-to.patch

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

29 March 2020, 07:48:42

Just update the patch which do some test case changes.
1. add "ANALYZE" command before running the explain.
2. order by with an explicit collate settings.

Thanks Rushabh for pointing this out, or else I'd spend much more time to figure

out why I get a different order on Windows.

3. As for the postgres_fdw.sql, I just copied the results.out to expected.out,
that's should be correct based on the result. However I added my comment
around that.

The issue doesn't exist at all. The confusion was introduced by a misunderstanding

of the test case (I treated count (xx) filter (xxx) as a window function rather than an aggration

function). so just fixed the it cleanly.

Some other changes made in the new patch:

1. Fixed bug for UniqueKey calculation for OUTER join.

2. Fixed some typo error in comments.

3. Renamed the field "grantee" as "guarantee".

Best Regards

Andy Fan

Attachment

v3-0001-Maintain-UniqueKey-at-each-RelOptInfo-this-inform.patch

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

31 March 2020, 01:43:52

On Sun, 29 Mar 2020 at 20:50, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Some other changes made in the new patch:
> 1.   Fixed bug for UniqueKey calculation for OUTER join.
> 2.   Fixed some typo error in comments.
> 3.   Renamed the field "grantee" as "guarantee".

I've had a look over this patch. Thank for you doing further work on it.

I've noted down the following during my read of the code:

1. There seem to be some cases where joins are no longer being
detected as unique. This is evident in postgres_fdw.out. We shouldn't
be regressing any of these cases.

2. The following change does not seem like it should be part of this
patch.  I understand you perhaps have done as you think it will
improve the performance of checking if an expression is in a list of
expressions.

- COMPARE_SCALAR_FIELD(varno);
+ /* Compare varattno first since it has higher selectivity than varno */
  COMPARE_SCALAR_FIELD(varattno);
+ COMPARE_SCALAR_FIELD(varno);

If you think that is true, then please do it as a separate effort and
provide benchmarks with your findings.

3. list_all_members_in. I think this would be better named as
list_is_subset. Please follow the lead of bms_is_subset().
Additionally, you should Assert that IsPointerList is true as there's
nothing else to indicate that it can't be used for an int or Oid list.

4. guarantee is not a very good name for the field in UniqueKey.
Maybe something like is_not_null?

5. I think you should be performing a bms_del_member during join
removal rather than removing this Assert()

- Assert(bms_equal(rel->relids, root->all_baserels));

FWIW, it's far from perfect that you've needed to delay the left join
removal, but I do understand why you've done it. It's also far from
perfect that you're including removed relations in the
total_table_pages calculation. c6e4133fae1 took some measures to
improve this calculation and this is making it worse again.

6. Can you explain why you moved the populate_baserel_uniquekeys()
call out of set_plain_rel_size()?

7. I don't think the removal of rel_supports_distinctness() is
warranted.  Is it not ok to check if the relation has any uniquekeys?
It's possible, particularly in join_is_removable that this can save
quite a large amount of effort.

8. Your spelling of unique is incorrect in many places:

src/backend/nodes/makefuncs.c: * makeUnqiueKey
src/backend/optimizer/path/uniquekeys.c:static List
*initililze_unqiuecontext_for_joinrel(RelOptInfo *joinrel,
src/backend/optimizer/path/uniquekeys.c: * check if combination of
unqiuekeys from both side is still useful for us,
src/backend/optimizer/path/uniquekeys.c:        outerrel_uniquekey_ctx
= initililze_unqiuecontext_for_joinrel(joinrel, outerrel);
src/backend/optimizer/path/uniquekeys.c:        innerrel_uniquekey_ctx
= initililze_unqiuecontext_for_joinrel(joinrel, innerrel);
src/backend/optimizer/path/uniquekeys.c: * we need to convert the
UnqiueKey from sub_final_rel to currel via the positions info in
src/backend/optimizer/path/uniquekeys.c:                ctx->pos =
pos; /* the position in current targetlist,  will be used to set
UnqiueKey */
src/backend/optimizer/path/uniquekeys.c: * Check if Unqiue key of the
innerrel is valid after join. innerrel's UniqueKey
src/backend/optimizer/path/uniquekeys.c: * initililze_unqiuecontext_for_joinrel
src/backend/optimizer/path/uniquekeys.c: * all the unqiuekeys which
are not possible to use later
src/backend/optimizer/path/uniquekeys.c:initililze_unqiuecontext_for_joinrel(RelOptInfo
*joinrel,  RelOptInfo *inputrel)
src/backend/optimizer/plan/analyzejoins.c:                      /*
This UnqiueKey is what we want */
src/backend/optimizer/plan/planner.c:   /* If we the result if unqiue
already, we just return the input_rel directly */
src/include/nodes/pathnodes.h: * exprs is a list of exprs which is
unqiue on current RelOptInfo.
src/test/regress/expected/join.out:-- XXXX: since b.id is unqiue now
so the group by cluase is erased, so
src/test/regress/expected/select_distinct.out:-- create unqiue index on dist_p
src/test/regress/expected/select_distinct.out:-- we also support
create unqiue index on each child tables
src/test/regress/sql/join.sql:-- XXXX: since b.id is unqiue now so the
group by cluase is erased, so
src/test/regress/sql/select_distinct.sql:-- create unqiue index on dist_p
src/test/regress/sql/select_distinct.sql:-- we also support create
unqiue index on each child tables

9. A few things wrong with the following fragment:

/* set the not null info now */
ListCell *lc;
foreach(lc, find_nonnullable_vars(qual))
{
Var *var = lfirst_node(Var, lc);
RelOptInfo *rel = root->simple_rel_array[var->varno];
if (var->varattno > InvalidAttrNumber)
rel->not_null_cols = bms_add_member(rel->not_null_cols, var->varattno);
}

a. including a function call in the foreach macro is not a practise
that we really follow. It's true that the macro now assigns the 2nd
param to a variable. Previous to 1cff1b95ab6 this was not the case and
it's likely best not to leave any bad examples around that code which
might get backported might follow.
b. We generally subtract InvalidAttrNumber from varattno when
including in a Bitmapset.
c. not_null_cols is not well named. I think notnullattrs
d. not_null_cols should not be a Relids type, it should be Bitmapset.

10. add_uniquekey_for_onerow() seems pretty wasteful.  Is there really
a need to add each item in the rel's targetlist to the uniquekey list?
What if we just add an empty list to the unique keys, that way if we
need to test if some expr is a superset of any uniquekey, then we'll
see it is as any set is a superset of an empty set.  Likely the empty
set of uniquekeys should be the only one in the rel's uniquekey list.

11. In create_distinct_paths() the code is now calling
get_sortgrouplist_exprs() multiple times with the same input. I think
it would be better to just call it once and set the result in a local
variable.

12. The comment in the code below is not true. The List contains
Lists, of which contain UniqueKeys

List    *uniquekeys; /* List of UniqueKey */

13. I'm having trouble parsing the final sentence in:

+ * can only guarantee the uniqueness without considering the null values. This
+ * field is necessary for remove_useless_join & reduce_unique_semijions since
+ * these cases don't care about the null values.

Why is the field which stores the nullability of the key required for
code that does not care about the nullability of the key?

Also please check your spelling of the word "join"

14. In the following fragment, instead of using i+1, please assign the
FormData_pg_attribute to a variable named attr and use attr->attnum.
Also, please see what I mentioned above about subtracting
InvalidAttrNumber

+ rel->not_null_cols = bms_add_member(rel->not_null_cols, i+1);

15. The tests you've changed the expected outcome of in join.out
should be updated so that the GROUP BY and DISTINCT clause is not
removed. This will allow the test to continue testing what it was
intended to test. You can do this by changing the columns in the GROUP
BY clause so that the new code does not find uniquekeys for those
columns.

16. The tests in aggregates.out are in a similar situation. There are
various tests trying to ensure that remove_useless_groupby_columns()
does what it's meant to do. You can modify these tests to add a join
which is non-unique to effectively duplicate the PK column.

17. In your select_distinct tests, can you move away from naming the
tables starting with select_distinct?  It makes reading queries pretty
hard.

e.g. explain (costs off) select distinct uk1, uk2 from
select_distinct_a where uk2 is not null;

When I first glanced that, I failed to see the underscores and the
query looked invalid.

18. Check the spelling if "erased". You have it spelt as "ereased" in
a couple of locations.

19. Please pay attention to the capitalisation of SQL keywords in the
test files you've modified. I understand we're very inconsistent in
this department in general, but we do at least try not to mix
capitalisation within the same file.  Basically, please upper case the
keywords in select_distinct.sql

20. In addition to the above, please try to wrap long SQL lines so
they're below 80 chars.

I'll review the patch in more detail once the above points have been addressed.

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Tom Lane

Date:

31 March 2020, 03:11:49

[ not a review, just some drive-by comments on David's comments ]

David Rowley <dgrowleyml@gmail.com> writes:
> 2. The following change does not seem like it should be part of this
> patch.  I understand you perhaps have done as you think it will
> improve the performance of checking if an expression is in a list of
> expressions.

> - COMPARE_SCALAR_FIELD(varno);
> + /* Compare varattno first since it has higher selectivity than varno */
>   COMPARE_SCALAR_FIELD(varattno);
> + COMPARE_SCALAR_FIELD(varno);

> If you think that is true, then please do it as a separate effort and
> provide benchmarks with your findings.

By and large, I'd reject such micro-optimizations on their face.
The rule in the nodes/ support files is to list fields in the same
order they're declared in.  There is no chance that it's worth
deviating from that for this.

I can believe that there'd be value in, say, comparing all
scalar fields before all non-scalar ones.  But piecemeal hacks
wouldn't be the way to handle that either.  In any case, I'd
prefer to implement such a plan within the infrastructure to
auto-generate these files that Andres keeps muttering about.

> a. including a function call in the foreach macro is not a practise
> that we really follow. It's true that the macro now assigns the 2nd
> param to a variable. Previous to 1cff1b95ab6 this was not the case and
> it's likely best not to leave any bad examples around that code which
> might get backported might follow.

No, I think you're misremembering.  foreach's second arg is
single-evaluation in all branches.  There were some preliminary
versions of 1cff1b95ab6 in which it would not have been, but that
was sufficiently dangerous that I found a way to get rid of it.

> b. We generally subtract InvalidAttrNumber from varattno when
> including in a Bitmapset.

ITYM FirstLowInvalidHeapAttributeNumber, but yeah.  Otherwise
the code fails on system columns, and there's seldom a good
reason to risk that.

            regards, tom lane

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

01 April 2020, 00:43:40

Thanks David for your time, I will acknowledge every item you mentioned

with the updated patch. Now I just talk about part of them.

1. There seem to be some cases where joins are no longer being
detected as unique. This is evident in postgres_fdw.out. We shouldn't
be regressing any of these cases.

You are correct, the issue here is I didn't distinguish the one_row case

in UniqueKey struct. Actually when a outer relation is join with a relation

which has only one row, there must be at most one row match the outer join.

The only-one-row case in postgres_fdw.out come from aggregation

call. I will added one more field "bool onerow" in UniqueKey struct. and

also try your optimization suggestion for the onerow UniqueKey.

2. The following change does not seem like it should be part of this
patch. I understand you perhaps have done as you think it will
improve the performance of checking if an expression is in a list of
expressions.

- COMPARE_SCALAR_FIELD(varno);
+ /* Compare varattno first since it has higher selectivity than varno */
COMPARE_SCALAR_FIELD(varattno);
+ COMPARE_SCALAR_FIELD(varno);

I did have a hesitation when I make this changes. so I'd rollback this change

at the following patch.

If you think that is true, then please do it as a separate effort and
provide benchmarks with your findings.

3. list_all_members_in. I think this would be better named as
list_is_subset. Please follow the lead of bms_is_subset().
Additionally, you should Assert that IsPointerList is true as there's
nothing else to indicate that it can't be used for an int or Oid list.

4. guarantee is not a very good name for the field in UniqueKey.
Maybe something like is_not_null?

5. I think you should be performing a bms_del_member during join
removal rather than removing this Assert()

- Assert(bms_equal(rel->relids, root->all_baserels));

FWIW, it's far from perfect that you've needed to delay the left join
removal, but I do understand why you've done it. It's also far from
perfect that you're including removed relations in the
total_table_pages calculation. c6e4133fae1 took some measures to
improve this calculation and this is making it worse again.

Since the removed relation depends on the UniqueKey which has to be

calculated after total_table_pages calculation in current code, so that's

something I must do. But if the relation is not removable, there is no waste

at all. If it is removable, such gain will much higher than the loss. I'm

not sure this should be a concern.

Actually looks the current remove_useless_join has some limits which can't

remove a joinrel, I still didn't figure out why. In the past we have some limited

ability to detect the unqiueness after join, so that's would be ok. Since we have

such ability now, this may be another opportunity to improve the join_is_removable

function, but I'd not like put such thing in this patch.

Since you said "far from perfect" twice for this point and I only get one reason (we

may plan a node which we removed later), do I miss the other one?

6. Can you explain why you moved the populate_baserel_uniquekeys()
call out of set_plain_rel_size()?

This is to be consistent with populate_partitionedrel_uniquekeys, which

is set at set_append_rel_pathlist.

7. I don't think the removal of rel_supports_distinctness() is
warranted. Is it not ok to check if the relation has any uniquekeys?

I think this is a good suggestion. I will follow that.

8. Your spelling of unique is incorrect in many places:

src/backend/nodes/makefuncs.c: * makeUnqiueKey
src/backend/optimizer/path/uniquekeys.c:static List
*initililze_unqiuecontext_for_joinrel(RelOptInfo *joinrel,
src/backend/optimizer/path/uniquekeys.c: * check if combination of
unqiuekeys from both side is still useful for us,
src/backend/optimizer/path/uniquekeys.c: outerrel_uniquekey_ctx
= initililze_unqiuecontext_for_joinrel(joinrel, outerrel);
src/backend/optimizer/path/uniquekeys.c: innerrel_uniquekey_ctx
= initililze_unqiuecontext_for_joinrel(joinrel, innerrel);
src/backend/optimizer/path/uniquekeys.c: * we need to convert the
UnqiueKey from sub_final_rel to currel via the positions info in
src/backend/optimizer/path/uniquekeys.c: ctx->pos =
pos; /* the position in current targetlist, will be used to set
UnqiueKey */
src/backend/optimizer/path/uniquekeys.c: * Check if Unqiue key of the
innerrel is valid after join. innerrel's UniqueKey
src/backend/optimizer/path/uniquekeys.c: * initililze_unqiuecontext_for_joinrel
src/backend/optimizer/path/uniquekeys.c: * all the unqiuekeys which
are not possible to use later
src/backend/optimizer/path/uniquekeys.c:initililze_unqiuecontext_for_joinrel(RelOptInfo
*joinrel, RelOptInfo *inputrel)
src/backend/optimizer/plan/analyzejoins.c: /*
This UnqiueKey is what we want */
src/backend/optimizer/plan/planner.c: /* If we the result if unqiue
already, we just return the input_rel directly */
src/include/nodes/pathnodes.h: * exprs is a list of exprs which is
unqiue on current RelOptInfo.
src/test/regress/expected/join.out:-- XXXX: since b.id is unqiue now
so the group by cluase is erased, so
src/test/regress/expected/select_distinct.out:-- create unqiue index on dist_p
src/test/regress/expected/select_distinct.out:-- we also support
create unqiue index on each child tables
src/test/regress/sql/join.sql:-- XXXX: since b.id is unqiue now so the
group by cluase is erased, so
src/test/regress/sql/select_distinct.sql:-- create unqiue index on dist_p
src/test/regress/sql/select_distinct.sql:-- we also support create
unqiue index on each child tables

9. A few things wrong with the following fragment:

/* set the not null info now */
ListCell *lc;
foreach(lc, find_nonnullable_vars(qual))
{
Var *var = lfirst_node(Var, lc);
RelOptInfo *rel = root->simple_rel_array[var->varno];
if (var->varattno > InvalidAttrNumber)
rel->not_null_cols = bms_add_member(rel->not_null_cols, var->varattno);
}

a. including a function call in the foreach macro is not a practise
that we really follow. It's true that the macro now assigns the 2nd
param to a variable. Previous to 1cff1b95ab6 this was not the case and
it's likely best not to leave any bad examples around that code which
might get backported might follow.
b. We generally subtract InvalidAttrNumber from varattno when
including in a Bitmapset.
c. not_null_cols is not well named. I think notnullattrs
d. not_null_cols should not be a Relids type, it should be Bitmapset.

If it is a Bitmapset, we have to pass it with "&" usually. is it our practice?

10. add_uniquekey_for_onerow() seems pretty wasteful. Is there really
a need to add each item in the rel's targetlist to the uniquekey list?
What if we just add an empty list to the unique keys, that way if we
need to test if some expr is a superset of any uniquekey, then we'll
see it is as any set is a superset of an empty set. Likely the empty
set of uniquekeys should be the only one in the rel's uniquekey list.

11. In create_distinct_paths() the code is now calling
get_sortgrouplist_exprs() multiple times with the same input. I think
it would be better to just call it once and set the result in a local
variable.

12. The comment in the code below is not true. The List contains
Lists, of which contain UniqueKeys

List *uniquekeys; /* List of UniqueKey */

It is a list of UniqueKey, the UniqueKey can have a list of exprs.

13. I'm having trouble parsing the final sentence in:

+ * can only guarantee the uniqueness without considering the null values. This
+ * field is necessary for remove_useless_join & reduce_unique_semijions since
+ * these cases don't care about the null values.

Why is the field which stores the nullability of the key required for
code that does not care about the nullability of the key?

The guarantee is introduced to for the following cases:

create table t1 (a int primary key, b int);
create table t2 (a int primary key, b int);
select .. from t1, (select b from t2 group by t2) t2 ..;

-- b is nullable. so t2(b) can't be a normal UniqueKey (which means b may have some

duplicated rows)

create unique index t2_uk_b on t2(b);

-- the left join still can be removed since t2.b is a unique index and the nullable

doesn't matter here.

select t1.* from t1 left join t2 on (t1.b = t2.b);

do you think we have can do some optimization in this case? I don't understand

your question well.

15. The tests you've changed the expected outcome of in join.out
should be updated so that the GROUP BY and DISTINCT clause is not
removed. This will allow the test to continue testing what it was
intended to test. You can do this by changing the columns in the GROUP
BY clause so that the new code does not find uniquekeys for those
columns.

Thanks for your explanation, very impressive!

16. The tests in aggregates.out are in a similar situation. There are
various tests trying to ensure that remove_useless_groupby_columns()
does what it's meant to do. You can modify these tests to add a join
which is non-unique to effectively duplicate the PK column.

17. In your select_distinct tests, can you move away from naming the
tables starting with select_distinct? It makes reading queries pretty
hard.

e.g. explain (costs off) select distinct uk1, uk2 from
select_distinct_a where uk2 is not null;

When I first glanced that, I failed to see the underscores and the
query looked invalid.

18. Check the spelling if "erased". You have it spelt as "ereased" in
a couple of locations.

OK, I just installed a spell check plugin for my editor, hope it will catch such

errors next time.

19. Please pay attention to the capitalisation of SQL keywords in the
test files you've modified. I understand we're very inconsistent in
this department in general, but we do at least try not to mix
capitalisation within the same file. Basically, please upper case the
keywords in select_distinct.sql

20. In addition to the above, please try to wrap long SQL lines so
they're below 80 chars.

I'll review the patch in more detail once the above points have been addressed.

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

01 April 2020, 03:38:42

On Wed, 1 Apr 2020 at 13:45, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>> 5. I think you should be performing a bms_del_member during join
>> removal rather than removing this Assert()
>>
>> - Assert(bms_equal(rel->relids, root->all_baserels));
>>
>> FWIW, it's far from perfect that you've needed to delay the left join
>> removal, but I do understand why you've done it. It's also far from
>> perfect that you're including removed relations in the
>> total_table_pages calculation. c6e4133fae1 took some measures to
>> improve this calculation and this is making it worse again.
>>
> Since the removed relation depends on the UniqueKey which has to be
> calculated after  total_table_pages calculation in current code, so that's
> something I must do.  But if the relation is not removable,  there is no waste
> at all.  If it is removable,  such gain will much higher than the loss.  I'm
> not sure this should be a concern.

The reason join removals was done so early in planning before was to
save the planner from having to do additional work for relations which
were going to be removed later.  For example, building path lists.

> Actually looks the current remove_useless_join has some limits which can't
> remove a joinrel,  I still didn't figure out why.  In the past we have some limited
> ability to detect the unqiueness after join, so that's would be ok.  Since  we have
> such ability now,  this may be another opportunity to improve the join_is_removable
> function, but I'd not like put such thing in this patch.

Yeah, there's certainly more left join shapes that we could remove.
e.g when the left join relation is not a singleton rel.  We shouldn't
do anything to purposefully block additional join removals as a result
of adding UniqueKeys, but likely shouldn't go to any trouble to make
additional ones work. That can be done later.

> Since you said "far from perfect" twice for this point and I only get one reason (we
> may plan a node which we removed later),  do I miss the other one?

a) additional planning work by not removing the join sooner. b) wrong
total page calculation.

In theory b) could be fixed by subtracting the removed join rels pages
after we remove it, but unfortunately, there's no point since we've
built the paths by that time already and we really only use the value
to determine how much IO is going to be random vs sequential, which is
determined during set_base_rel_pathlists()

>> d. not_null_cols should not be a Relids type, it should be Bitmapset.
>>
> If it is a Bitmapset,  we have to pass it with "&" usually.  is it our practice?

Well, a Bitmapset pointer.   Relids is saved for range table indexes.
Storing anything else in there is likely to lead to confusion.

>> 12. The comment in the code below is not true. The List contains
>> Lists, of which contain UniqueKeys
>>
>> List    *uniquekeys; /* List of UniqueKey */
>>
> It is a list of UniqueKey,  the UniqueKey can have a list of exprs.

Hmm, so this is what I called a UniqueKeySet in the original patch.
I'm a bit divided by that change. With PathKeys, technically you can
make use of a Path with a given set of PathKeys if you only require
some leading subset of those keys.  That's not the case for
UniqueKeys, it's all or nothing, so perhaps having the singular name
is better than the plural name I gave it. However, I'm not certain.

(Really PathKey does not seem like a great name in the first place
since it has nothing to do with keys)

>> 13. I'm having trouble parsing the final sentence in:
>>
>> + * can only guarantee the uniqueness without considering the null values. This
>> + * field is necessary for remove_useless_join & reduce_unique_semijions since
>> + * these cases don't care about the null values.
>>
>> Why is the field which stores the nullability of the key required for
>> code that does not care about the nullability of the key?
>>
> The guarantee is introduced to for the following cases:
>
> create table t1 (a int primary key, b int);
> create table t2 (a int primary key, b int);
> select .. from t1,  (select b from t2 group by t2) t2 ..;
>
> -- b is nullable.  so t2(b) can't be a normal UniqueKey (which means b may have some
> duplicated rows)
> create unique index t2_uk_b on t2(b);
>
> -- the left join still can be removed since t2.b is a unique index and the nullable
> doesn't matter here.
> select t1.* from t1 left join t2 on (t1.b = t2.b);
>
> do you think we have can do some optimization in this case? I don't understand
> your question well.

OK, so by "don't care", you mean, don't duplicate NULL values.  I
assumed you had meant that it does not matter either way, as in: don't
mind if there are NULL values or not. It might be best to have a go at
changing the wording to be more explicit to what you mean there.

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

03 April 2020, 02:17:11

The updated patch should fixed all the issues. See the comments below for more

information.

On Tue, Mar 31, 2020 at 9:44 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Sun, 29 Mar 2020 at 20:50, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Some other changes made in the new patch:
> 1. Fixed bug for UniqueKey calculation for OUTER join.
> 2. Fixed some typo error in comments.
> 3. Renamed the field "grantee" as "guarantee".

I've had a look over this patch. Thank for you doing further work on it.

I've noted down the following during my read of the code:

1. There seem to be some cases where joins are no longer being
detected as unique. This is evident in postgres_fdw.out. We shouldn't
be regressing any of these cases.

The issue here is I didn't distinguish the one_row case in UniqueKey struct.

Actually when a outer relation is join with a relation which has only one row,

there must be at most one row match the outer join.The only-one-row case in

postgres_fdw.out come from aggregation call.

Added a new field "onerow" in UniqueKey struct. and optimize the onerow

UniqueKey to not record every exprs. See add_uniquekey_for_onerow

and relation_is_onerow.

2. The following change does not seem like it should be part of this
patch. I understand you perhaps have done as you think it will
improve the performance of checking if an expression is in a list of
expressions.

- COMPARE_SCALAR_FIELD(varno);
+ /* Compare varattno first since it has higher selectivity than varno */
COMPARE_SCALAR_FIELD(varattno);
+ COMPARE_SCALAR_FIELD(varno);

If you think that is true, then please do it as a separate effort and
provide benchmarks with your findings.

Rollbacked.

3. list_all_members_in. I think this would be better named as
list_is_subset. Please follow the lead of bms_is_subset().
Additionally, you should Assert that IsPointerList is true as there's
nothing else to indicate that it can't be used for an int or Oid list.

Done

4. guarantee is not a very good name for the field in UniqueKey.
Maybe something like is_not_null?

I tried is_not_null, but when it is is_not_null equals false, it is a double

negation, and not feel good for me. At last, I used multi_nullvals to show

the UniqueKey may yield multi null values so the uniqueness is not guaranteed.

5. I think you should be performing a bms_del_member during join
removal rather than removing this Assert()

- Assert(bms_equal(rel->relids, root->all_baserels));

Done

FWIW, it's far from perfect that you've needed to delay the left join
removal, but I do understand why you've done it. It's also far from
perfect that you're including removed relations in the
total_table_pages calculation. c6e4133fae1 took some measures to
improve this calculation and this is making it worse again.

6. Can you explain why you moved the populate_baserel_uniquekeys()
call out of set_plain_rel_size()?

7. I don't think the removal of rel_supports_distinctness() is
warranted. Is it not ok to check if the relation has any uniquekeys?
It's possible, particularly in join_is_removable that this can save
quite a large amount of effort.

Done

8. Your spelling of unique is incorrect in many places:

src/backend/nodes/makefuncs.c: * makeUnqiueKey
src/backend/optimizer/path/uniquekeys.c:static List
*initililze_unqiuecontext_for_joinrel(RelOptInfo *joinrel,
src/backend/optimizer/path/uniquekeys.c: * check if combination of
unqiuekeys from both side is still useful for us,
src/backend/optimizer/path/uniquekeys.c: outerrel_uniquekey_ctx
= initililze_unqiuecontext_for_joinrel(joinrel, outerrel);
src/backend/optimizer/path/uniquekeys.c: innerrel_uniquekey_ctx
= initililze_unqiuecontext_for_joinrel(joinrel, innerrel);
src/backend/optimizer/path/uniquekeys.c: * we need to convert the
UnqiueKey from sub_final_rel to currel via the positions info in
src/backend/optimizer/path/uniquekeys.c: ctx->pos =
pos; /* the position in current targetlist, will be used to set
UnqiueKey */
src/backend/optimizer/path/uniquekeys.c: * Check if Unqiue key of the
innerrel is valid after join. innerrel's UniqueKey
src/backend/optimizer/path/uniquekeys.c: * initililze_unqiuecontext_for_joinrel
src/backend/optimizer/path/uniquekeys.c: * all the unqiuekeys which
are not possible to use later
src/backend/optimizer/path/uniquekeys.c:initililze_unqiuecontext_for_joinrel(RelOptInfo
*joinrel, RelOptInfo *inputrel)
src/backend/optimizer/plan/analyzejoins.c: /*
This UnqiueKey is what we want */
src/backend/optimizer/plan/planner.c: /* If we the result if unqiue
already, we just return the input_rel directly */
src/include/nodes/pathnodes.h: * exprs is a list of exprs which is
unqiue on current RelOptInfo.
src/test/regress/expected/join.out:-- XXXX: since b.id is unqiue now
so the group by cluase is erased, so
src/test/regress/expected/select_distinct.out:-- create unqiue index on dist_p
src/test/regress/expected/select_distinct.out:-- we also support
create unqiue index on each child tables
src/test/regress/sql/join.sql:-- XXXX: since b.id is unqiue now so the
group by cluase is erased, so
src/test/regress/sql/select_distinct.sql:-- create unqiue index on dist_p
src/test/regress/sql/select_distinct.sql:-- we also support create
unqiue index on each child tables

9. A few things wrong with the following fragment:

/* set the not null info now */
ListCell *lc;
foreach(lc, find_nonnullable_vars(qual))
{
Var *var = lfirst_node(Var, lc);
RelOptInfo *rel = root->simple_rel_array[var->varno];
if (var->varattno > InvalidAttrNumber)
rel->not_null_cols = bms_add_member(rel->not_null_cols, var->varattno);
}

a. including a function call in the foreach macro is not a practise
that we really follow. It's true that the macro now assigns the 2nd
param to a variable. Previous to 1cff1b95ab6 this was not the case and
it's likely best not to leave any bad examples around that code which
might get backported might follow.
b. We generally subtract InvalidAttrNumber from varattno when
including in a Bitmapset.
c. not_null_cols is not well named. I think notnullattrs
d. not_null_cols should not be a Relids type, it should be Bitmapset.

Above 2 Done

10. add_uniquekey_for_onerow() seems pretty wasteful. Is there really
a need to add each item in the rel's targetlist to the uniquekey list?
What if we just add an empty list to the unique keys, that way if we
need to test if some expr is a superset of any uniquekey, then we'll
see it is as any set is a superset of an empty set. Likely the empty
set of uniquekeys should be the only one in the rel's uniquekey list.

Now I use a single UniqueKey to show this situation. See

add_uniquekey_for_onerow and relation_is_onerow.

11. In create_distinct_paths() the code is now calling
get_sortgrouplist_exprs() multiple times with the same input. I think
it would be better to just call it once and set the result in a local
variable.

Done

12. The comment in the code below is not true. The List contains
Lists, of which contain UniqueKeys

List *uniquekeys; /* List of UniqueKey */

13. I'm having trouble parsing the final sentence in:

+ * can only guarantee the uniqueness without considering the null values. This
+ * field is necessary for remove_useless_join & reduce_unique_semijions since
+ * these cases don't care about the null values.

Why is the field which stores the nullability of the key required for
code that does not care about the nullability of the key?

Also please check your spelling of the word "join"

Actually I didn't find the spell error for "join"..

14. In the following fragment, instead of using i+1, please assign the
FormData_pg_attribute to a variable named attr and use attr->attnum.
Also, please see what I mentioned above about subtracting
InvalidAttrNumber

+ rel->not_null_cols = bms_add_member(rel->not_null_cols, i+1);

Done

15. The tests you've changed the expected outcome of in join.out
should be updated so that the GROUP BY and DISTINCT clause is not
removed. This will allow the test to continue testing what it was
intended to test. You can do this by changing the columns in the GROUP
BY clause so that the new code does not find uniquekeys for those
columns.

Done

16. The tests in aggregates.out are in a similar situation. There are
various tests trying to ensure that remove_useless_groupby_columns()
does what it's meant to do. You can modify these tests to add a join
which is non-unique to effectively duplicate the PK column.

There are some exceptions at this part.

1. The test for remove_useless_groupby_columns has some overlap

with our current erasing group node logic, like the test for a single relation.

so I just modified 2 tests case for this purpose.

2. When I read the code in remove_useless_groupby_columns, I found a

new case for our UniqueKey.

select * from m1 where a > (select avg(b) from m2 group by M1.A);

where the m1.a will have var->varlevelsup > 0, how should we set the UniqueKey

for this grouprel . I add some in-completed check at add_uniquekey_from_sortgroups

function. but I'm not sure if we need that.

3. remove_useless_groupby_columns maintains the parse->constraintDeps

when it depends on primary key, but UniqueKey doesn't maintain such data.

since we have translation layer which should protect us from the concurrency issue

and isolation issue. Do we need to do that as well in UniqueKey?

17. In your select_distinct tests, can you move away from naming the
tables starting with select_distinct? It makes reading queries pretty
hard.

e.g. explain (costs off) select distinct uk1, uk2 from
select_distinct_a where uk2 is not null;

When I first glanced that, I failed to see the underscores and the
query looked invalid.

18. Check the spelling if "erased". You have it spelt as "ereased" in
a couple of locations.

19. Please pay attention to the capitalisation of SQL keywords in the
test files you've modified. I understand we're very inconsistent in
this department in general, but we do at least try not to mix
capitalisation within the same file. Basically, please upper case the
keywords in select_distinct.sql

20. In addition to the above, please try to wrap long SQL lines so
they're below 80 chars.

All above 4 item Done.

I'll review the patch in more detail once the above points have been addressed.

David

Attachment

v4-0001-Maintain-UniqueKey-at-each-RelOptInfo-this-inform.patch

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

03 April 2020, 03:08:43

On Fri, 3 Apr 2020 at 15:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> The updated patch should fixed all the issues.  See the comments below for more
> information.
>
> On Tue, Mar 31, 2020 at 9:44 AM David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> + * can only guarantee the uniqueness without considering the null values. This
>> + * field is necessary for remove_useless_join & reduce_unique_semijions since
>> + * these cases don't care about the null values.
>>
>> Why is the field which stores the nullability of the key required for
>> code that does not care about the nullability of the key?
>>
>> Also please check your spelling of the word "join"
>>
>
> Actually I didn't find the spell error for "join"..

It was in reduce_unique_semijions. That should be
reduce_unique_semijoins. I see you fixed it in the patch though.

> 3.  remove_useless_groupby_columns maintains the parse->constraintDeps
> when it depends on primary key,  but UniqueKey doesn't maintain such data.
> since we have translation layer which should protect us from the concurrency issue
> and isolation issue.  Do we need to do that as well in UniqueKey?

I'm pretty sure that code is pretty bogus in
remove_useless_groupby_columns(). It perhaps was just copied from
check_functional_grouping(), where it is required. Looks like the
(ahem) author of d4c3a156c got that wrong... :-(

The reason check_functional_grouping() needs it is for things like
creating a view with a GROUP BY clause that has a column in the SELECT
list that is functionally dependant on the GROUP BY columns. e.g:

create table z (a int primary key, b int);
create view view_z as select a,b from z group by a;
alter table z drop constraint z_pkey;
ERROR:  cannot drop constraint z_pkey on table z because other objects
depend on it
DETAIL:  view view_z depends on constraint z_pkey on table z
HINT:  Use DROP ... CASCADE to drop the dependent objects too.

Here that view would become invalid if the PK was dropped, so we must
record the dependency in that case.  Doing so is what causes that
error message.

For just planner smarts such as LEFT JOIN removal, Unique Joins, and
all this Unique Key stuff, we really don't need to record the
dependency as if the index or constraint is dropped, then that'll
cause a relcache invalidation and we'll see the invalidation when we
attempt to execute the cached plan. That will cause the statement to
be re-planned and we'll not see the unique index when we do that.

We should probably just get rid of that code in
remove_useless_groupby_columns() to stop people getting confused about
that.

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Tom Lane

Date:

03 April 2020, 03:40:51

David Rowley <dgrowleyml@gmail.com> writes:
> For just planner smarts such as LEFT JOIN removal, Unique Joins, and
> all this Unique Key stuff, we really don't need to record the
> dependency as if the index or constraint is dropped, then that'll
> cause a relcache invalidation and we'll see the invalidation when we
> attempt to execute the cached plan. That will cause the statement to
> be re-planned and we'll not see the unique index when we do that.

You need to make sure that the thing you're concerned about will actually
cause a relcache invalidation of a table in the query.  But yeah, if it
will then there's not a need to have any other invalidation mechanism.

(It occurs to me BTW that we've been overly conservative about using
NOT NULL constraints in planning, because of failing to consider that.
Addition or drop of NOT NULL has to cause a change in
pg_attribute.attnotnull, which will definitely cause a relcache inval
on its table, cf rules in CacheInvalidateHeapTuple().  So we *don't*
need to have a pg_constraint entry corresponding to the NOT NULL, as
we've mistakenly supposed in some past discussions.)

            regards, tom lane

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

03 April 2020, 04:07:54

On Fri, 3 Apr 2020 at 16:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (It occurs to me BTW that we've been overly conservative about using
> NOT NULL constraints in planning, because of failing to consider that.
> Addition or drop of NOT NULL has to cause a change in
> pg_attribute.attnotnull, which will definitely cause a relcache inval
> on its table, cf rules in CacheInvalidateHeapTuple().  So we *don't*
> need to have a pg_constraint entry corresponding to the NOT NULL, as
> we've mistakenly supposed in some past discussions.)

Agreed for remove_useless_groupby_columns(), but we'd need it if we
wanted to detect functional dependencies in
check_functional_grouping() using unique indexes.

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

03 April 2020, 08:54:33

On Fri, Apr 3, 2020 at 12:08 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 3 Apr 2020 at 16:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> (It occurs to me BTW that we've been overly conservative about using
> NOT NULL constraints in planning, because of failing to consider that.
> Addition or drop of NOT NULL has to cause a change in
> pg_attribute.attnotnull, which will definitely cause a relcache inval
> on its table, cf rules in CacheInvalidateHeapTuple(). So we *don't*
> need to have a pg_constraint entry corresponding to the NOT NULL, as
> we've mistakenly supposed in some past discussions.)

Agreed for remove_useless_groupby_columns(), but we'd need it if we
wanted to detect functional dependencies in
check_functional_grouping() using unique indexes.

Thanks for the explanation. I will add the removal in the next version of this

patch.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

04 April 2020, 01:31:32

On Fri, 3 Apr 2020 at 21:56, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
>
> On Fri, Apr 3, 2020 at 12:08 PM David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> On Fri, 3 Apr 2020 at 16:40, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> > (It occurs to me BTW that we've been overly conservative about using
>> > NOT NULL constraints in planning, because of failing to consider that.
>> > Addition or drop of NOT NULL has to cause a change in
>> > pg_attribute.attnotnull, which will definitely cause a relcache inval
>> > on its table, cf rules in CacheInvalidateHeapTuple().  So we *don't*
>> > need to have a pg_constraint entry corresponding to the NOT NULL, as
>> > we've mistakenly supposed in some past discussions.)
>>
>> Agreed for remove_useless_groupby_columns(), but we'd need it if we
>> wanted to detect functional dependencies in
>> check_functional_grouping() using unique indexes.
>
>
> Thanks for the explanation.  I will add the removal in the next version of this
> patch.

There's no need for this patch to touch
remove_useless_groupby_columns().  Fixes for that should be considered
independently and *possibly* even backpatched.

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

14 April 2020, 09:09:31

On Fri, 3 Apr 2020 at 15:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> All above 4 item Done.

Just to explain my view on this going forward for PG14.  I do plan to
do a more thorough review of this soon.  I wasn't so keen on pursuing
this for PG13 as the skip scans patch [1] needs to use the same
infrastructure this patch has added and it does not, yet.

The infrastructure (knowing the unique properties of a RelOptInfo), as
provided by the patch Andy has been working on, which is based on my
rough prototype version, I believe should be used for the skip scans
patch as well.  I understand that as skip scans currently stands,
Jasper has done quite a bit of work to add the UniqueKeys, however,
this was unfortunately based on some early description of UniqueKeys
where I had thought that we could just store EquivalenceClasses. I no
longer think that's the case, and I believe the implementation that we
require is something more along the lines of Andy's latest version of
the patch.  However, I've not quite stared at it long enough to be
highly confident in that.

I'd like to strike up a bit of a plan to move both Andy's work and the
Skip scans work forward for PG14.

Here are my thoughts so far:

1. Revise v4 of remove DISTINCT patch to split the patch into two pieces.

0001 should add the UniqueKey code but not any additional planner
smarts to use them (i.e remove GROUP BY / DISTINCT) elimination parts.
Join removals and Unique joins should use UniqueKeys in this patch.
0002 should add back the GROUP BY  / DISTINCT smarts and add whatever
tests should be added for that and include updating existing expected
results and modifying any tests which no longer properly test what
they're meant to be testing.

I've done this with the attached patch.

2. David / Jesper to look at 0001 and build or align the existing skip
scans 0001 patch to make use of Andy's 0001 patch. This will require
tagging UniqueKeys onto Paths, not just RelOptInfos, plus a bunch of
other work.


Clearly UniqueKeys must suit both needs and since we have two
different implementations each providing some subset of the features,
then clearly we're not yet ready to move both skip scans and this
patch forward together. We need to align that and move both patches
forward together. Hopefully, the attached 0001 patch helps move that
along.



While I'm here, a quick review of Andy's v4 patch. I didn't address
any of this in the attached v5. These are only based on what I saw
when shuffling some code around. It's not an in-depth review.

1. Out of date comment in join.sql

-- join removal is not possible when the GROUP BY contains a column that is
-- not in the join condition.  (Note: as of 9.6, we notice that b.id is a
-- primary key and so drop b.c_id from the GROUP BY of the resulting plan;
-- but this happens too late for join removal in the outer plan level.)
explain (costs off)
select d.* from d left join (select d, c_id from b group by b.d, b.c_id) s
  on d.a = s.d;

You've changed the GROUP BY clause so it does not include b.id, so the
Note in the comment is now misleading.

2. I think 0002 is overly restrictive in its demands that
parse->hasAggs must be false. We should be able to just use a Group
Aggregate with unsorted input when the input_rel is unique on the
GROUP BY clause.  This will save on hashing and sorting.  Basically
similar to what we do for when a query contains aggregates without any
GROUP BY.

3. I don't quite understand why you changed this to a right join:

 -- Test case where t1 can be optimized but not t2
 explain (costs off) select t1.*,t2.x,t2.z
-from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
+from t1 right join t2 on t1.a = t2.x and t1.b = t2.y

Perhaps this change is left over from some previous version of the patch?

[1] https://commitfest.postgresql.org/27/1741/

Attachment

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

15 April 2020, 00:18:48

Hi David:

Thanks for your time.

1. Out of date comment in join.sql

-- join removal is not possible when the GROUP BY contains a column that is
-- not in the join condition. (Note: as of 9.6, we notice that b.id is a
-- primary key and so drop b.c_id from the GROUP BY of the resulting plan;
-- but this happens too late for join removal in the outer plan level.)
explain (costs off)
select d.* from d left join (select d, c_id from b group by b.d, b.c_id) s
on d.a = s.d;

You've changed the GROUP BY clause so it does not include b.id, so the
Note in the comment is now misleading.

Thanks, I will fix this one in the following patch.

2. I think 0002 is overly restrictive in its demands that
parse->hasAggs must be false. We should be able to just use a Group
Aggregate with unsorted input when the input_rel is unique on the
GROUP BY clause. This will save on hashing and sorting. Basically
similar to what we do for when a query contains aggregates without any
GROUP BY.

Yes, This will be a perfect result, the difficult is the current aggregation function

execution is highly coupled with Agg node(ExecInitAgg) which is removed in the

unique case. I ever make the sum (w/o finalfn) and avg(with finalfn)

works in a hack way, but still many stuffs is not handled. Let me prepare the code

for this purpose in 1~2 days to see if I'm going with the right direction.

Ashutosh also has an idea[1] that if the relation underlying an Agg node is

known to be unique for given groupByClause, we could safely use

AGG_SORTED strategy. Though the input is not ordered, it's sorted thus for every row Agg

node will combine/finalize the aggregate result.

I will target the perfect result first and see how many effort do we need, if not,

I will try Ashutosh's suggestion.

3. I don't quite understand why you changed this to a right join:

-- Test case where t1 can be optimized but not t2
explain (costs off) select t1.*,t2.x,t2.z
-from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y
+from t1 right join t2 on t1.a = t2.x and t1.b = t2.y

Perhaps this change is left over from some previous version of the patch?

This is on purpose. the original test case is used to test we can short

the group key for t1 but not t2 for aggregation, but if I keep the inner join, the

aggnode will be removed totally, so I have to change it to right join in order

to keep the aggnode. The full test case is:

-- Test case where t1 can be optimized but not t2

explain (costs off) select t1.*,t2.x,t2.z

from t1 inner join t2 on t1.a = t2.x and t1.b = t2.y

group by t1.a,t1.b,t1.c,t1.d,t2.x,t2.z;

where (a, b) is the primary key of t1.

[1] https://www.postgresql.org/message-id/CAExHW5sY%2BL6iZ%3DrwnL7n3jET7aNLCNQimvfcS7C%2B5wmdjmdPiw%40mail.gmail.com

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

15 April 2020, 03:00:40

On Wed, 15 Apr 2020 at 12:19, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>> 2. I think 0002 is overly restrictive in its demands that
>> parse->hasAggs must be false. We should be able to just use a Group
>> Aggregate with unsorted input when the input_rel is unique on the
>> GROUP BY clause.  This will save on hashing and sorting.  Basically
>> similar to what we do for when a query contains aggregates without any
>> GROUP BY.
>>
>
> Yes,  This will be a perfect result,  the difficult is the current aggregation function
> execution is highly coupled with Agg node(ExecInitAgg) which is removed in the
> unique case.

This case here would be slightly different. It would be handled by
still creating a Group Aggregate path, but just not consider Hash
Aggregate and not Sort the input to the Group Aggregate path. Perhaps
that's best done by creating a new flag bit and using it in
create_grouping_paths() in the location where we set the flags
variable.  If you determine that the input_rel is unique for the GROUP
BY clause, and that there are aggregate functions, then set a flag,
e.g GROUPING_INPUT_UNIQUE. Likely there will be a few other flags that
you can skip setting in that function, for example, there's no need to
check if the input can sort, so no need for GROUPING_CAN_USE_SORT,
since you won't need to sort, likewise for GROUPING_CAN_USE_HASH. I'd
say there also is no need for checking if we can set
GROUPING_CAN_PARTIAL_AGG (What would be the point in doing partial
aggregation when there's 1 row per group?)   Then down in
add_paths_to_grouping_rel(), just add a special case before doing any
other code, such as:

if ((extra->flags & GROUPING_INPUT_UNIQUE) != 0 && parse->groupClause != NIL)
{
Path    *path = input_rel->cheapest_total_path;

add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
path,
grouped_rel->reltarget,
AGG_SORTED,
AGGSPLIT_SIMPLE,
parse->groupClause,
havingQual,
agg_costs,
dNumGroups));
return;
}

You may also want to consider the cheapest startup path there too so
that the LIMIT processing can do something smarter later in planning
(assuming cheapest_total_path != cheapest_startup_path (which you'd
need to check for)).

Perhaps it would be better to only set the GROUPING_INPUT_UNIQUE if
there is a groupClause, then just Assert(parse->groupClause != NIL)
inside that if.

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

16 April 2020, 02:17:00

On Wed, Apr 15, 2020 at 11:00 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Wed, 15 Apr 2020 at 12:19, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>> 2. I think 0002 is overly restrictive in its demands that
>> parse->hasAggs must be false. We should be able to just use a Group
>> Aggregate with unsorted input when the input_rel is unique on the
>> GROUP BY clause. This will save on hashing and sorting. Basically
>> similar to what we do for when a query contains aggregates without any
>> GROUP BY.
>>
>
> Yes, This will be a perfect result, the difficult is the current aggregation function
> execution is highly coupled with Agg node(ExecInitAgg) which is removed in the
> unique case.

This case here would be slightly different. It would be handled by
still creating a Group Aggregate path, but just not consider Hash
Aggregate and not Sort the input to the Group Aggregate path. Perhaps
that's best done by creating a new flag bit and using it in
create_grouping_paths() in the location where we set the flags
variable. If you determine that the input_rel is unique for the GROUP
BY clause, and that there are aggregate functions, then set a flag,
e.g GROUPING_INPUT_UNIQUE. Likely there will be a few other flags that
you can skip setting in that function, for example, there's no need to
check if the input can sort, so no need for GROUPING_CAN_USE_SORT,
since you won't need to sort, likewise for GROUPING_CAN_USE_HASH. I'd
say there also is no need for checking if we can set
GROUPING_CAN_PARTIAL_AGG (What would be the point in doing partial
aggregation when there's 1 row per group?) Then down in
add_paths_to_grouping_rel(), just add a special case before doing any
other code, such as:

if ((extra->flags & GROUPING_INPUT_UNIQUE) != 0 && parse->groupClause != NIL)
{
Path *path = input_rel->cheapest_total_path;

add_path(grouped_rel, (Path *)
create_agg_path(root,
grouped_rel,
path,
grouped_rel->reltarget,
AGG_SORTED,
AGGSPLIT_SIMPLE,
parse->groupClause,
havingQual,
agg_costs,
dNumGroups));
return;
}

You may also want to consider the cheapest startup path there too so
that the LIMIT processing can do something smarter later in planning
(assuming cheapest_total_path != cheapest_startup_path (which you'd
need to check for)).

Perhaps it would be better to only set the GROUPING_INPUT_UNIQUE if
there is a groupClause, then just Assert(parse->groupClause != NIL)
inside that if.

Thank you for your detailed explanation. The attached v6 has included this feature.

Here is the the data to show the improvement.

Test cases:

create table grp2 (a int primary key, b char(200), c int);
insert into grp2 select i, 'x', i from generate_series(1, 10000000)i;
analyze grp2;
explain analyze select a, sum(c) from grp2 group by a;

w/o this feature:

postgres=# explain analyze select a, sum(c) from grp2 group by a;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.43..712718.44 rows=10000000 width=12) (actual time=0.088..15491.027 rows=10000000 loops=1)
Group Key: a
-> Index Scan using grp2_pkey on grp2 (cost=0.43..562718.44 rows=10000000 width=8) (actual time=0.068..6503.459 rows=10000000 loops=1)
Planning Time: 0.916 ms
Execution Time: 16252.397 ms
(5 rows)

Since the order of my data in heap and index is exactly same, which makes

the index scan much faster. The following is to test the cost of the *hash* aggregation,

postgres=# set enable_indexscan to off;
SET
postgres=# explain analyze select a, sum(c) from grp2 group by a;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=765531.00..943656.00 rows=10000000 width=12) (actual time=14424.379..30133.171 rows=10000000 loops=1)
Group Key: a
Planned Partitions: 128
Peak Memory Usage: 4153 kB
Disk Usage: 2265608 kB
HashAgg Batches: 640
-> Seq Scan on grp2 (cost=0.00..403031.00 rows=10000000 width=8) (actual time=0.042..2808.281 rows=10000000 loops=1)
Planning Time: 0.159 ms
Execution Time: 31098.804 ms
(9 rows)

With this feature:

explain analyze select a, sum(c) from grp2 group by a;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=0.00..553031.57 rows=10000023 width=12) (actual time=0.044..13209.485 rows=10000000 loops=1)
Group Key: a
-> Seq Scan on grp2 (cost=0.00..403031.23 rows=10000023 width=8) (actual time=0.023..4938.171 rows=10000000 loops=1)
Planning Time: 0.400 ms
Execution Time: 13749.121 ms
(5 rows)

During the implementation, I also added AGG_UNIQUE AggStrategy to

record this information in Agg Plan node, this is a simple way to do it and

should be semantic correct.

V6 also includes:

1. Fix the comment misleading you mentioned above.

2. Fixed a concern case for `relation_has_uniquekeys_for` function.

+ /* For UniqueKey->onerow case, the uniquekey->exprs is empty as well
+ * so we can't rely on list_is_subset to handle this special cases
+ */
+ if (exprs == NIL)
+ return false;

Best Regards

Andy Fan

Attachment

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Ashutosh Bapat

Date:

16 April 2020, 12:35:57

On Thu, Apr 16, 2020 at 7:47 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

> (9 rows)
>
> With this feature:
> explain analyze select a, sum(c) from grp2 group by a;
>                                                         QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------
>  GroupAggregate  (cost=0.00..553031.57 rows=10000023 width=12) (actual time=0.044..13209.485 rows=10000000 loops=1)
>    Group Key: a
>    ->  Seq Scan on grp2  (cost=0.00..403031.23 rows=10000023 width=8) (actual time=0.023..4938.171 rows=10000000
loops=1)
>  Planning Time: 0.400 ms
>  Execution Time: 13749.121 ms
> (5 rows)
>

Applying the patch gives a white space warning
git am /tmp/v6-000*
Applying: Introduce UniqueKeys to determine RelOptInfo unique properties
.git/rebase-apply/patch:545: indent with spaces.
    /* Fast path */
warning: 1 line adds whitespace errors.
Applying: Skip DISTINCT / GROUP BY if input is already unique

Compiling the patch causes one warning
nodeAgg.c:2134:3: warning: enumeration value ‘AGG_UNIQUE’ not handled
in switch [-Wswitch]

I have not looked at the patch. The numbers above look good. The time
spent in summing up a column in each row (we are summing only one
number per group) is twice the time it took to read those rows from
the table. That looks odd. But it may not be something unrelated to
your patch. I also observed that for explain analyze select a from
grp2 group by a; we just produce a plan containing seq scan node,
which is a good thing.

--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

17 April 2020, 01:16:59

On Thu, Apr 16, 2020 at 8:36 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

On Thu, Apr 16, 2020 at 7:47 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

> (9 rows)
>
> With this feature:
> explain analyze select a, sum(c) from grp2 group by a;
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------------------
> GroupAggregate (cost=0.00..553031.57 rows=10000023 width=12) (actual time=0.044..13209.485 rows=10000000 loops=1)
> Group Key: a
> -> Seq Scan on grp2 (cost=0.00..403031.23 rows=10000023 width=8) (actual time=0.023..4938.171 rows=10000000 loops=1)
> Planning Time: 0.400 ms
> Execution Time: 13749.121 ms
> (5 rows)
>

Applying the patch gives a white space warning
git am /tmp/v6-000*
Applying: Introduce UniqueKeys to determine RelOptInfo unique properties
.git/rebase-apply/patch:545: indent with spaces.
/* Fast path */
warning: 1 line adds whitespace errors.
Applying: Skip DISTINCT / GROUP BY if input is already unique

Compiling the patch causes one warning
nodeAgg.c:2134:3: warning: enumeration value ‘AGG_UNIQUE’ not handled
in switch [-Wswitch]

Thanks, I will fix them together with some detailed review suggestion.

(I know the review need lots of time, so appreciated for it).

I have not looked at the patch. The numbers above look good. The time
spent in summing up a column in each row (we are summing only one
number per group) is twice the time it took to read those rows from
the table. That looks odd. But it may not be something unrelated to
your patch. I also observed that for explain analyze select a from
grp2 group by a; we just produce a plan containing seq scan node,
which is a good thing.

Great and welcome back Ashutosh:)

--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

29 April 2020, 00:29:20

On Thu, 16 Apr 2020 at 14:17, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> V6 also includes:
> 1.  Fix the comment misleading you mentioned above.
> 2.  Fixed a concern case for `relation_has_uniquekeys_for` function.

Over on [1], Richard highlights a problem in the current join removals
lack of ability to remove left joins unless the min_righthand side of
the join is a singleton rel. It's my understanding that the reason the
code checks for this is down to the fact that join removals used
unique indexed to prove the uniqueness of the relation and obviously,
those can only exist on base relations.  I wondered if you might want
to look into a 0003 patch which removes that restriction? I think this
can be done now since we no longer look at unique indexes to provide
the proves that the join to be removed won't duplicate outer side
rows.

David

[1] https://www.postgresql.org/message-id/CAMbWs4-THacv3DdMpiTrvg5ZY7sNViFF1pTU=kOKmtPBrE9-0Q@mail.gmail.com

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

29 April 2020, 00:34:59

On Wed, Apr 29, 2020 at 8:29 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 16 Apr 2020 at 14:17, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> V6 also includes:
> 1. Fix the comment misleading you mentioned above.
> 2. Fixed a concern case for `relation_has_uniquekeys_for` function.

Over on [1], Richard highlights a problem in the current join removals
lack of ability to remove left joins unless the min_righthand side of
the join is a singleton rel. It's my understanding that the reason the
code checks for this is down to the fact that join removals used
unique indexed to prove the uniqueness of the relation and obviously,
those can only exist on base relations. I wondered if you might want
to look into a 0003 patch which removes that restriction? I think this
can be done now since we no longer look at unique indexes to provide
the proves that the join to be removed won't duplicate outer side
rows.

Yes, I think that would be another benefit of UniqueKey, but it doesn't happen

until now. I will take a look of it today and fix it in a separated commit.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

29 April 2020, 13:27:54

On Wed, Apr 29, 2020 at 8:34 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

On Wed, Apr 29, 2020 at 8:29 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Thu, 16 Apr 2020 at 14:17, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> V6 also includes:
> 1. Fix the comment misleading you mentioned above.
> 2. Fixed a concern case for `relation_has_uniquekeys_for` function.

Over on [1], Richard highlights a problem in the current join removals
lack of ability to remove left joins unless the min_righthand side of
the join is a singleton rel. It's my understanding that the reason the
code checks for this is down to the fact that join removals used
unique indexed to prove the uniqueness of the relation and obviously,
those can only exist on base relations. I wondered if you might want
to look into a 0003 patch which removes that restriction? I think this
can be done now since we no longer look at unique indexes to provide
the proves that the join to be removed won't duplicate outer side
rows.

Yes, I think that would be another benefit of UniqueKey, but it doesn't happen
until now. I will take a look of it today and fix it in a separated commit.

I have make it work locally, the basic idea is to postpone the join removal at

build_join_rel stage where the uniquekey info is well maintained. I will test

more to send a product-ready-target patch tomorrow.

# explain (costs off) select a.i from a left join b on a.i = b.i and
b.j in (select j from c);
QUERY PLAN
---------------
Seq Scan on a
(1 row)

Best Regard

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

07 May 2020, 01:31:51

I just uploaded the v7 version and split it into smaller commits for easier

review/merge. I also maintain a up-to-date README.uniquekey
document since something may changed during discussion or later code.

Here is the simple introduction of each commit.

====
1. v7-0001-Introduce-RelOptInfo-notnullattrs-attribute.patch

This commit adds the notnullattrs to RelOptInfo, which grabs the information
from both catalog and user's query.

2. v7-0002-Introuduce-RelOptInfo.uniquekeys-attribute.patch

This commit just add the uniquekeys to RelOptInfo and maintain it at every
stage. However the upper level code is not changed due to this.

Some changes of this part in v7:
1). Removed the UniqueKey.positions attribute. In the past it is used in
convert_subquery_uniquekeys, however we don't need it actually (And I
maintained it wrong in the past). Now I build the relationship between the
outer var to subuqery's TargetList with outrel.subquery.processed_tlist.
2). onerow UniqueKey(exprs = NIL) need to be converted to normal uniquekey(exprs
!= NIL) if it is not one-row any more. This may happen on some outer join.

3. v7-0003-Refactor-existing-uniqueness-related-code-to-use-.patch

Refactor the existing functions like innerrel_is_unique/res_is_distinct_for to
use UniqueKey, and postpone the call of remove_useless_join and

reduce_unique_semijoins to use the new implementation.

4. v7-0004-Remove-distinct-node-AggNode-if-the-input-is-uniq.patch

Remove the distinct node if the result is distinct already. Remove the aggnode
if the group by clause is unique already AND there is no aggregation function in
query.

5. v7-0005-If-the-group-by-clause-is-unique-and-we-have-aggr.patch

If the group by clause is unique and query has aggregation function, we use
the AGG_SORT strategy but without really sort since it has only one row in each

group.

6. v7-0006-Join-removal-at-run-time-with-UniqueKey.patch

This commit run join removal at build_join_rel. At that time, it can fully uses
unique key. It can handle some more cases, I added some new test cases to
join.sql. However it can be a replacement of the current one. There are some
cases the new strategy can work run well but the current one can. Like

SELECT a.* FROM a LEFT JOIN (b left join c on b.c_id = c.id) ON (a.b_id = b.id);

during the join a & b, the join can't be removed since b.id is still useful in
future. However in the future, we know the b.id can be removed as well, but
it is too late to remove the previous join.

At the implementation part, the main idea is if the join_canbe_removed. we
will copy the pathlist from outerrel to joinrel. There are several items need to
handle.

1. To make sure the overall join_search_one_level, we have to keep the joinrel
even the innerrel is removed (rather than discard the joinrel).
2. If the innerrel can be removed, we don't need to build pathlist for joinrel,
we just reuse the pathlist from outerrel. However there are many places where
use assert rel->pathlist[*]->parent == rel. so I copied the pathlist, we
have to change the parent to joinrel.
3. During create plan for some path on RTE_RELATION, it needs to know the
relation Oid with path->parent->relid. so we have to use the outerrel->relid
to overwrite the joinrel->relid which is 0 before.
4. Almost same paths as item 3, it usually assert best_path->parent->rtekind ==
RTE_RELATION; now the path may appeared in joinrel, so I used
outerrel->rtekind to overwrite joinrel->rtekind.
5. I guess there are some dependencies between path->pathtarget and
rel->reltarget. since we reuse the pathlist of outerrel, so I used the
outer->reltarget as well. If the join can be removed, I guess the length of
list_length(outrel->reltarget->exprs) >= (joinrel->reltarget->exprs). we can
rely on the ProjectionPath to reduce the tlist.

My patches is based on the current latest commit fb544735f1.

Best Regards

Andy Fan

Attachment

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Ashutosh Bapat

Date:

07 May 2020, 11:26:27

Hi Andy,
Sorry for delay in review. Your earlier patches are very large and it
requires some time to review those. I didn't finish reviewing those
but here are whatever comments I have till now on the previous set of
patches. Please see if any of those are useful to the new set.


+/*
+ * Return true iff there is an equal member in target for every
+ * member in members

Suggest reword: return true iff every entry in "members" list is also present
in the "target" list. This function doesn't care about multi-sets, so please
mention that in the prologue clearly.

+
+    if (root->parse->hasTargetSRFs)
+        return;

Why? A relation's uniqueness may be useful well before we work on SRFs.

+
+    if (baserel->reloptkind == RELOPT_OTHER_MEMBER_REL)
+        /*
+         * Set UniqueKey on member rel is useless, we have to recompute it at
+         * upper level, see populate_partitionedrel_uniquekeys for reference
+         */
+        return;

Handling these here might help in bottom up approach. We annotate each
partition here and then annotate partitioned table based on the individual
partitions. Same approach can be used for partitioned join produced by
partitionwise join.

+        /*
+         * If the unique index doesn't contain partkey, then it is unique
+         * on this partition only, so it is useless for us.
+         */

Not really. It could help partitionwise join.

+
+    /* Now we have the unique index list which as exactly same on all
childrels,
+     * Set the UniqueIndex just like it is non-partition table
+     */

I think it's better to annotate each partition with whatever unique index it
has whether or not global. That will help partitionwise join, partitionwise
aggregate/group etc.

+    /* A Normal group by without grouping set */
+    if (parse->groupClause)
+        add_uniquekey_from_sortgroups(root,
+                                      grouprel,
+                                      root->parse->groupClause);

Those keys which are part of groupClause and also form unique keys in the input
relation, should be recorded as unique keys in group rel. Knowing the minimal
set of keys allows more optimizations.

+
+    foreach(lc,  unionrel->reltarget->exprs)
+    {
+        exprs = lappend(exprs, lfirst(lc));
+        colnos = lappend_int(colnos, i);
+        i++;
+    }

This should be only possible when it's not UNION ALL. We should add some assert
or protection for that.

+
+    /* Fast path */
+    if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+        return;
+
+    outer_is_onerow = relation_is_onerow(outerrel);
+    inner_is_onerow = relation_is_onerow(innerrel);
+
+    outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(joinrel,
outerrel);
+    innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(joinrel,
innerrel);
+
+    clause_list = gather_mergeable_joinclauses(joinrel, outerrel, innerrel,
+                                               restrictlist, jointype);

Something similar happens in select_mergejoin_clauses(). At least from the
first reading of this patch, I get an impression that all these functions which
are going through clause lists and index lists should be merged into other
functions which go through these lists hunting for some information useful to
the optimizer.

+
+
+    if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+    {
+        foreach(lc, innerrel_ukey_ctx)
+        {
+            UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+            if (!list_is_subset(ctx->uniquekey->exprs,
joinrel->reltarget->exprs))
+            {
+                /* The UniqueKey on baserel is not useful on the joinrel */

A joining relation need not be a base rel always, it could be a join rel as
well.

+                ctx->useful = false;
+                continue;
+            }
+            if ((jointype == JOIN_LEFT || jointype == JOIN_FULL) &&
!ctx->uniquekey->multi_nullvals)
+            {
+                /* Change the multi_nullvals to true at this case */

Need a comment explaining this. Generally, I feel, this and other functions in
this file need good comments explaining the logic esp. "why" instead of "what".

+            else if (inner_is_onerow)
+            {
+                /* Since rows in innerrel can't be duplicated AND if
innerrel is onerow,
+                 * the join result will be onerow also as well. Note:
onerow implies
+                 * multi_nullvals = false.

I don't understand this comment. Why is there only one row in the other
relation which can join to this row?

+    }
+    /*
+     * Calculate max_colno in subquery. In fact we can check this with
+     * list_length(sub_final_rel->reltarget->exprs), However, reltarget
+     * is not set on UPPERREL_FINAL relation, so do it this way
+     */


Should/can we use the same logic to convert an expression in the subquery into
a Var of the outer query as in convert_subquery_pathkeys(). If the subquery
doesn't have a reltarget set, we should be able to use reltarget of any of its
paths since all of those should match the positions across subquery and the
outer query.

Will continue reviewing your new set of patches as time permits.

On Thu, May 7, 2020 at 7:02 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> I just uploaded the v7 version and split it into smaller commits for easier
> review/merge. I also maintain a  up-to-date README.uniquekey
> document since something may changed during discussion or later code.
>
> Here is the simple introduction of each commit.
>
> ====
> 1. v7-0001-Introduce-RelOptInfo-notnullattrs-attribute.patch
>
> This commit adds the notnullattrs to RelOptInfo,  which grabs the information
> from both catalog and user's query.
>
>
> 2. v7-0002-Introuduce-RelOptInfo.uniquekeys-attribute.patch
>
> This commit just add the uniquekeys to RelOptInfo and maintain it at every
> stage. However the upper level code is not changed due to this.
>
> Some changes of this part in v7:
> 1). Removed the UniqueKey.positions attribute. In the past it is used in
>     convert_subquery_uniquekeys, however we don't need it actually (And I
>     maintained it wrong in the past). Now I build the relationship between the
>     outer var to subuqery's TargetList with outrel.subquery.processed_tlist.
> 2). onerow UniqueKey(exprs = NIL) need to be converted to normal uniquekey(exprs
>    != NIL) if it is not one-row any more. This may happen on some outer join.
>
>
> 3. v7-0003-Refactor-existing-uniqueness-related-code-to-use-.patch
>
> Refactor the existing functions like innerrel_is_unique/res_is_distinct_for to
> use UniqueKey, and postpone the call of remove_useless_join and
> reduce_unique_semijoins to use the new implementation.
>
> 4. v7-0004-Remove-distinct-node-AggNode-if-the-input-is-uniq.patch
>
> Remove the distinct node if the result is distinct already.  Remove the aggnode
> if the group by clause is unique already AND there is no aggregation function in
> query.
>
> 5. v7-0005-If-the-group-by-clause-is-unique-and-we-have-aggr.patch
>
> If the group by clause is unique and query has aggregation function, we use
> the AGG_SORT strategy but without really sort since it has only one row in each
> group.
>
>
> 6. v7-0006-Join-removal-at-run-time-with-UniqueKey.patch
>
> This commit run join removal at build_join_rel.  At that time, it can fully uses
> unique key. It can handle some more cases, I added some new test cases to
> join.sql. However it can be a replacement of the current one. There are some
> cases the new strategy can work run well but the current one can.  Like
>
> SELECT a.* FROM a LEFT JOIN (b left join c on b.c_id = c.id) ON (a.b_id = b.id);
>
> during the join a & b, the join can't be removed since b.id is still useful in
> future. However in the future, we know the b.id can be removed as well, but
> it is too late to remove the previous join.
>
> At the implementation part,  the main idea is if the join_canbe_removed. we
> will copy the pathlist from outerrel to joinrel. There are several items need to
> handle.
>
> 1. To make sure the overall join_search_one_level, we have to keep the joinrel
>    even the innerrel is removed (rather than discard the joinrel).
> 2. If the innerrel can be removed, we don't need to build pathlist for joinrel,
>    we just reuse the pathlist from outerrel. However there are many places where
>    use assert rel->pathlist[*]->parent == rel. so I copied the pathlist, we
>    have to change the parent to joinrel.
> 3. During create plan for some path on RTE_RELATION, it needs to know the
>    relation Oid with path->parent->relid. so we have to use the outerrel->relid
>    to overwrite the joinrel->relid which is 0 before.
> 4. Almost same paths as item 3, it usually assert best_path->parent->rtekind ==
>    RTE_RELATION; now the path may appeared in joinrel, so I used
>    outerrel->rtekind to overwrite joinrel->rtekind.
> 5. I guess there are some dependencies between path->pathtarget and
>    rel->reltarget. since we reuse the pathlist of outerrel, so I used the
>    outer->reltarget as well. If the join can be removed, I guess the length of
>    list_length(outrel->reltarget->exprs) >= (joinrel->reltarget->exprs). we can
>    rely on the ProjectionPath to reduce the tlist.
>
> My patches is based on the current latest commit fb544735f1.
>
> Best Regards
> Andy Fan



-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

08 May 2020, 01:57:24

Hi Ashutosh:

Appreciate for your comments!

On Thu, May 7, 2020 at 7:26 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

Hi Andy,
Sorry for delay in review.

I understand no one has obligation to do that, and it must take reviewer's time

and more, so really appreciated for it! Hope I can provide more kindly help like

this in future as well.

Your earlier patches are very large and it
requires some time to review those. I didn't finish reviewing those
but here are whatever comments I have till now on the previous set of
patches. Please see if any of those are useful to the new set.

Yes, I just realized the size as well, so I split them to smaller commit and

each commit and be build and run make check successfully.

All of your comments still valid except the last one (covert_subquery_uniquekeys)

which has been fixed v7 version.

+/*
+ * Return true iff there is an equal member in target for every
+ * member in members

Suggest reword: return true iff every entry in "members" list is also present
in the "target" list.

Will do, thanks!

This function doesn't care about multi-sets, so please
mention that in the prologue clearly.

+
+ if (root->parse->hasTargetSRFs)
+ return;

Why? A relation's uniqueness may be useful well before we work on SRFs.

Looks I misunderstand when the SRF function is executed. I will fix this in v8.

+
+ if (baserel->reloptkind == RELOPT_OTHER_MEMBER_REL)
+ /*
+ * Set UniqueKey on member rel is useless, we have to recompute it at
+ * upper level, see populate_partitionedrel_uniquekeys for reference
+ */
+ return;

Handling these here might help in bottom up approach. We annotate each
partition here and then annotate partitioned table based on the individual
partitions. Same approach can be used for partitioned join produced by
partitionwise join.

+ /*
+ * If the unique index doesn't contain partkey, then it is unique
+ * on this partition only, so it is useless for us.
+ */

Not really. It could help partitionwise join.

+
+ /* Now we have the unique index list which as exactly same on all
childrels,
+ * Set the UniqueIndex just like it is non-partition table
+ */

I think it's better to annotate each partition with whatever unique index it
has whether or not global. That will help partitionwise join, partitionwise
aggregate/group etc.

Excellent catch! All the 3 items above is partitionwise join related, I need some time

to check how to handle that.

+ /* A Normal group by without grouping set */
+ if (parse->groupClause)
+ add_uniquekey_from_sortgroups(root,
+ grouprel,
+ root->parse->groupClause);

Those keys which are part of groupClause and also form unique keys in the input
relation, should be recorded as unique keys in group rel. Knowing the minimal
set of keys allows more optimizations.

This is a very valid point now. I ignored this because I wanted to remove the AggNode

totally if the part of groupClause is unique, However it doesn't happen later if there is

aggregation call in this query.

+
+ foreach(lc, unionrel->reltarget->exprs)
+ {
+ exprs = lappend(exprs, lfirst(lc));
+ colnos = lappend_int(colnos, i);
+ i++;
+ }

This should be only possible when it's not UNION ALL. We should add some assert
or protection for that.

OK, actually I called this function in generate_union_paths. which handle

UNION case only. I will add the Assert anyway.

+
+ /* Fast path */
+ if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+ return;
+
+ outer_is_onerow = relation_is_onerow(outerrel);
+ inner_is_onerow = relation_is_onerow(innerrel);
+
+ outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(joinrel,
outerrel);
+ innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(joinrel,
innerrel);
+
+ clause_list = gather_mergeable_joinclauses(joinrel, outerrel, innerrel,
+ restrictlist, jointype);

Something similar happens in select_mergejoin_clauses().

I didn't realized this before. I will refactor this code accordingly if necessary

after reading that.

At least from the
first reading of this patch, I get an impression that all these functions which
are going through clause lists and index lists should be merged into other
functions which go through these lists hunting for some information useful to
the optimizer.

+
+
+ if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+ {
+ foreach(lc, innerrel_ukey_ctx)
+ {
+ UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+ if (!list_is_subset(ctx->uniquekey->exprs,
joinrel->reltarget->exprs))
+ {
+ /* The UniqueKey on baserel is not useful on the joinrel */

A joining relation need not be a base rel always, it could be a join rel as
well.

good catch.

+ ctx->useful = false;
+ continue;
+ }
+ if ((jointype == JOIN_LEFT || jointype == JOIN_FULL) &&
!ctx->uniquekey->multi_nullvals)
+ {
+ /* Change the multi_nullvals to true at this case */

Need a comment explaining this. Generally, I feel, this and other functions in
this file need good comments explaining the logic esp. "why" instead of "what".

Exactly.

+ else if (inner_is_onerow)
+ {
+ /* Since rows in innerrel can't be duplicated AND if
innerrel is onerow,
+ * the join result will be onerow also as well. Note:
onerow implies
+ * multi_nullvals = false.

I don't understand this comment. Why is there only one row in the other
relation which can join to this row?

I guess you may miss the onerow special case if I understand correctly.

inner_is_onerow means something like "SELECT xx FROM t1 where uk = 1".

innerrel can't be duplicated means: t1.y = t2.pk; so the finally result is onerow

as well. One of the overall query is SELECT .. FROM t1, t2 where t2.y = t2.pk;

I explained more about onerow in the v7 README.unqiuekey document, just copy

it here.

===

1. What is UniqueKey?

....

onerow is also a kind of UniqueKey which means the RelOptInfo will have 1 row at
most. it has a stronger semantic than others. like SELECT uk FROM t; uk is
normal unique key and may have different values.
SELECT colx FROM t WHERE uk = const. colx is unique AND we have only 1 value. This
field can used for innerrel_is_unique. and also be used as an optimization for
this case. We don't need to maintain multi UniqueKey, we just maintain one with
onerow = true and exprs = NIL.

onerow is set to true only for 2 cases right now. 1) SELECT .. FROM t WHERE uk =
1; 2). SELECT aggref(xx) from t; // Without group by.

===

2. How is it maintained?

....

More considerations about onerow:
1. If relation with one row and it can't be duplicated, it is still possible
contains mulit_nullvas after outer join.
2. If the either UniqueKey can be duplicated after join, the can get one row
only when both side is one row AND there is no outer join.
3. Whenever the onerow UniqueKey is not a valid any more, we need to convert one
row UniqueKey to normal unique key since we don't store exprs for one-row
relation. get_exprs_from_uniquekeys will be used here.

===

and 3. in the v7 implementation, the code struct is more clearer:)

+ }
+ /*
+ * Calculate max_colno in subquery. In fact we can check this with
+ * list_length(sub_final_rel->reltarget->exprs), However, reltarget
+ * is not set on UPPERREL_FINAL relation, so do it this way
+ */

Should/can we use the same logic to convert an expression in the subquery into
a Var of the outer query as in convert_subquery_pathkeys().

Yes, my previous implementation is actually wrong. and should be fixed it in v7.

If the subquery doesn't have a reltarget set, we should be able to use reltarget

of any of its paths since all of those should match the positions across subquery

and the outer query.

but I think it should be rel->subroot->processed_tlist rather than reltarget? Actually I still

a bit of uneasy about rel->subroot->processed_tlist for some DML case, which the

processed_tlist is different and I still not figure out its impact.

Will continue reviewing your new set of patches as time permits.

Thank you! Actually there is no big difference between v6 and v7 regarding the

UniqueKey part except 2 bug fix. However v7 has some more documents,

comments improvement and code refactor/split, which may be helpful

for review. You may try v7 next time if v8 has not come yet:)

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

13 May 2020, 11:59:08

The attached is the v8-patches. The main improvements are based on Ashutosh's
review (reduce the SRF impact and partition level UniqueKey). I also update the
README.uniquekey based on the discussion. So anyone who don't want to go
through the long email can read the README.uniquekey first.

===
Just copy some content from the README for easy discussion.

As for inherit table, we maintain the UnqiueKey on childrel as usual. But for
partitioned table we need to maintain 2 different kinds of UnqiueKey.

1). UniqueKey on the parent relation 2). UniqueKey on child relation for

partition wise query.

Example:
CREATE TABLE p (a int not null, b int not null) partition by list (a);
CREATE TABLE p0 partition of p for values in (1);
CREATE TABLE p1 partition of p for values in (2);

create unique index p0_b on p0(b);
create unique index p1_b on p1(b);

Now b is unique on partition level only, so the distinct can't be removed on
the following cases. SELECT DISTINCT b FROM p; However for query

SELECT DISTINCT b FROM p WHERE a = 1; where only one
partition is chosen, the UniqueKey on child relation is same as the UniqueKey

on parent relation. The distinct can be removed.

Another usage of UniqueKey on partition level is it be helpful for
partition-wise join.

As for the UniqueKey on parent table level, it comes with 2 different ways.

1). the UniqueKey is also derived in Unique index, but the index must be same
in all the related children relations and the unique index must contains
Partition Key in it. Example:

CREATE UNIQUE INDEX p_ab ON p(a, b); -- where a is the partition key.

-- Query
SELECT a, b FROM p; -- the (a, b) is a UniqueKey of p.

2). If the parent relation has only one childrel, the UniqueKey on childrel is
the UniqueKey on parent as well.

The patch structure is not changed, you can see [1] for reference. The patches is

based on latest commit ac3a4866c0.

[1] https://www.postgresql.org/message-id/CAKU4AWr1BmbQB4F7j22G%2BNS4dNuem6dKaUf%2B1BK8me61uBgqqg%40mail.gmail.com

Best Regards

Andy Fan

Attachment

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Ashutosh Bapat

Date:

13 May 2020, 12:04:17

On Fri, May 8, 2020 at 7:27 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

>> +            else if (inner_is_onerow)
>> +            {
>> +                /* Since rows in innerrel can't be duplicated AND if
>> innerrel is onerow,
>> +                 * the join result will be onerow also as well. Note:
>> onerow implies
>> +                 * multi_nullvals = false.
>>
>> I don't understand this comment. Why is there only one row in the other
>> relation which can join to this row?
>
>
> I guess you may miss the onerow special case if I understand correctly.
> inner_is_onerow means something like "SELECT xx FROM t1 where uk = 1".
> innerrel can't be duplicated means:  t1.y = t2.pk;  so the finally result is onerow
> as well.  One of the overall query is  SELECT .. FROM t1, t2 where t2.y = t2.pk;
>
>
> I explained more about onerow in the v7 README.unqiuekey document, just copy
> it here.
For some reason this mail remained in my drafts without being sent.
Sending it now. Sorry.

My impression about the one row stuff, is that there is too much
special casing around it. We should somehow structure the UniqueKey
data so that one row unique keys come naturally rather than special
cased. E.g every column in such a case is unique in the result so
create as many UniqueKeys are the number of columns or create one
unique key with no column as you have done but handle it more
gracefully rather than spreading it all over the place.

Also, the amount of code that these patches changes seems to be much
larger than the feature's worth arguably. But it indicates that we are
modifying/adding more code than necessary. Some of that code can be
merged into existing code which does similar things as I have pointed
out in my previous comment.

Thanks for working on the expanded scope of the initial feature you
proposed. But it makes the feature more useful, I think.

--
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

13 May 2020, 15:48:25

On Wed, May 13, 2020 at 8:04 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:

On Fri, May 8, 2020 at 7:27 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:

>> + else if (inner_is_onerow)
>> + {
>> + /* Since rows in innerrel can't be duplicated AND if
>> innerrel is onerow,
>> + * the join result will be onerow also as well. Note:
>> onerow implies
>> + * multi_nullvals = false.
>>
>> I don't understand this comment. Why is there only one row in the other
>> relation which can join to this row?
>
>
> I guess you may miss the onerow special case if I understand correctly.
> inner_is_onerow means something like "SELECT xx FROM t1 where uk = 1".
> innerrel can't be duplicated means: t1.y = t2.pk; so the finally result is onerow
> as well. One of the overall query is SELECT .. FROM t1, t2 where t2.y = t2.pk;
>
>
> I explained more about onerow in the v7 README.unqiuekey document, just copy
> it here.
For some reason this mail remained in my drafts without being sent.
Sending it now. Sorry.

My impression about the one row stuff, is that there is too much
special casing around it. We should somehow structure the UniqueKey
data so that one row unique keys come naturally rather than special
cased. E.g every column in such a case is unique in the result so
create as many UniqueKeys are the number of columns

This is the beginning state of the UniqueKey, later David suggested

this as an optimization[1], I buy-in the idea and later I found it mean

more than the original one [2], so I think onerow is needed actually.

or create one
unique key with no column as you have done but handle it more
gracefully rather than spreading it all over the place.

I think this is what I do now, but it is possible that I spread it more than

necessary, if so, please let me know. I maintained the README.uniquekey

carefully since v7 and improved a lot in v8, it may be a good place to check

it.

Also, the amount of code that these patches changes seems to be much
larger than the feature's worth arguably. But it indicates that we are
modifying/adding more code than necessary. Some of that code can be
merged into existing code which does similar things as I have pointed
out in my previous comment.

I have reused the code select_mergejoin_clause rather than maintaining my

own copies in v8. Thanks a lot about that suggestion. This happened mainly

because I didn't read enough code. I will go through more to see if I have similar

issues.

Thanks for working on the expanded scope of the initial feature you
proposed. But it makes the feature more useful, I think.

That's mainly because your suggestions are always insightful which makes me

willing to continue to work on it, so thank you all!

===

In fact, I was hesitated that how to reply an email when I send an new version

of the patch. One idea is I should explain clear what is the difference between Vn

and Vn-1. The other one is not many people read the Vn-1, so I would like to keep

the email simplified and keep the README clearly to save the reviewer's time.

Actually there are more changes in v8 than I stated above. for example: for the

UniqueKey on baserelation, we will reduce the expr from the UniqueKey if the

expr is a const. Unique on (A, B). query is SELECT b FROM t WHERE a = 1;

in v7, the UniqueKey is (a, b). In v8, the UniqueKey is (b) only. But since most

people still not start to read it, so I add such information to README rather than

echo the same in email thread. I will try more to understand how to communicate more

smooth. But any suggestion on this part is appreciated.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

13 May 2020, 16:02:42

Hi Ashutosh:

All your suggestions are followed except the UNION ALL one. I replied the reason

below. For the suggestions about partitioned table, looks lot of cases to handle, so

I summarize/enrich your idea in README and email thread, we can continue

to talk about that.

+
+ foreach(lc, unionrel->reltarget->exprs)
+ {
+ exprs = lappend(exprs, lfirst(lc));
+ colnos = lappend_int(colnos, i);
+ i++;
+ }

This should be only possible when it's not UNION ALL. We should add some assert
or protection for that.

OK, actually I called this function in generate_union_paths. which handle
UNION case only. I will add the Assert anyway.

Finally I found I have to add one more parameter to populate_unionrel_uniquekeys, and

the only usage of that parameter is used to Assert, so I didn't do that at last.

+
+ /* Fast path */
+ if (innerrel->uniquekeys == NIL || outerrel->uniquekeys == NIL)
+ return;
+
+ outer_is_onerow = relation_is_onerow(outerrel);
+ inner_is_onerow = relation_is_onerow(innerrel);
+
+ outerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(joinrel,
outerrel);
+ innerrel_ukey_ctx = initililze_uniquecontext_for_joinrel(joinrel,
innerrel);
+
+ clause_list = gather_mergeable_joinclauses(joinrel, outerrel, innerrel,
+ restrictlist, jointype);

Something similar happens in select_mergejoin_clauses().

I didn't realized this before. I will refactor this code accordingly if necessary
after reading that.

I reused select_mergejoin_clauses and removed the duplicated code in uniquekeys.c

in v8.

At least from the
first reading of this patch, I get an impression that all these functions which
are going through clause lists and index lists should be merged into other
functions which go through these lists hunting for some information useful to
the optimizer.
+
+
+ if (innerrel_keeps_unique(root, outerrel, innerrel, clause_list, false))
+ {
+ foreach(lc, innerrel_ukey_ctx)
+ {
+ UniqueKeyContext ctx = (UniqueKeyContext)lfirst(lc);
+ if (!list_is_subset(ctx->uniquekey->exprs,
joinrel->reltarget->exprs))
+ {
+ /* The UniqueKey on baserel is not useful on the joinrel */

A joining relation need not be a base rel always, it could be a join rel as
well.

good catch.

Fixed.

+ ctx->useful = false;
+ continue;
+ }
+ if ((jointype == JOIN_LEFT || jointype == JOIN_FULL) &&
!ctx->uniquekey->multi_nullvals)
+ {
+ /* Change the multi_nullvals to true at this case */

Need a comment explaining this. Generally, I feel, this and other functions in
this file need good comments explaining the logic esp. "why" instead of "what".

Exactly.

Done in v8.

Will continue reviewing your new set of patches as time permits.

Thank you! Actually there is no big difference between v6 and v7 regarding the
UniqueKey part except 2 bug fix. However v7 has some more documents,
comments improvement and code refactor/split, which may be helpful
for review. You may try v7 next time if v8 has not come yet:)

v8 has come :)

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

13 May 2020, 22:19:59

On Thu, 14 May 2020 at 03:48, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> On Wed, May 13, 2020 at 8:04 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>> My impression about the one row stuff, is that there is too much
>> special casing around it. We should somehow structure the UniqueKey
>> data so that one row unique keys come naturally rather than special
>> cased. E.g every column in such a case is unique in the result so
>> create as many UniqueKeys are the number of columns
>
>
> This is the beginning state of the UniqueKey,  later David suggested
> this as an optimization[1], I buy-in the idea and later I found it mean
> more than the original one [2], so I think onerow is needed actually.

Having the "onerow" flag was not how I intended it to work.

Here's an example of how I thought it should work:

Assume t1 has UniqueKeys on {a}

SELECT DISTINCT a,b FROM t1;

Here the DISTINCT can be a no-op due to "a" being unique within t1. Or
more basically, {a} is a subset of {a,b}.

The code which does this is relation_has_uniquekeys_for(), which
contains the code:

+ if (list_is_subset(ukey->exprs, exprs))
+ return true;

In this case, ukey->exprs is {a} and exprs is {a,b}. So, if the
UniqueKey's exprs are a subset of, in this case, the DISTINCT exprs
then relation_has_uniquekeys_for() returns true. Basically
list_is_subset({a}, {a,b}), Answer: "Yes".

For the onerow stuff, if we can prove the relation returns only a
single row, e.g an aggregate without a GROUP BY, or there are
EquivalenceClasses with ec_has_const == true for each key of a unique
index, then why can't set just set the UniqueKeys to {}?  That would
mean the code to determine if we can avoid performing an explicit
DISTINCT operation would be called with list_is_subset({}, {a,b}),
which is also true, in fact, an empty set is a subset of any set. Why
is there a need to special case that fact?

In light of those thoughts, can you explain why you think we need to
keep the onerow flag?

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

14 May 2020, 02:38:44

On Thu, May 14, 2020 at 6:20 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 14 May 2020 at 03:48, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> On Wed, May 13, 2020 at 8:04 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
>> My impression about the one row stuff, is that there is too much
>> special casing around it. We should somehow structure the UniqueKey
>> data so that one row unique keys come naturally rather than special
>> cased. E.g every column in such a case is unique in the result so
>> create as many UniqueKeys are the number of columns
>
>
> This is the beginning state of the UniqueKey, later David suggested
> this as an optimization[1], I buy-in the idea and later I found it mean
> more than the original one [2], so I think onerow is needed actually.

Having the "onerow" flag was not how I intended it to work.

Thanks for the detailed explanation. So I think we do need to handle onerow

specially, (It means more things than adding each column as a UniqueKey).

but we don't need the onerow flag since we can tell it by ukey->exprs == NIL.

During the developer of this feature, I added some Asserts as double checking

for onerow and exprs. it helps me to find some special cases. like

SELECT FROM multirows union SELECT FROM multirows; where targetlist is NIL.

(I find the above case returns onerow as well just now). so onerow flag allows us

check this special things with more attention. Even this is not the original intention

but looks it is the one purpose now.

However I am feeling that removing onerow flag doesn't require much of code

changes. Almost all the special cases which are needed before are still needed

after that and all the functions based on that like relation_is_onerow

/add_uniquekey_onerow is still valid, we just need change the implementation.

so do you think we need to remove onerow flag totally?

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

14 May 2020, 02:46:27

On Wed, May 13, 2020 at 11:48 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

My impression about the one row stuff, is that there is too much
special casing around it. We should somehow structure the UniqueKey
data so that one row unique keys come naturally rather than special
cased. E.g every column in such a case is unique in the result so
create as many UniqueKeys are the number of columns

This is the beginning state of the UniqueKey, later David suggested
this as an optimization[1], I buy-in the idea and later I found it mean
more than the original one [2], so I think onerow is needed actually.

I just found I forget the links yesterday. Here is it.

[1] https://www.postgresql.org/message-id/CAApHDvqg%2BmQyxJnCizE%3DqJcBL90L%3DoFXTFyiwWWEaUnzG7Uc5Q%40mail.gmail.com

[2] https://www.postgresql.org/message-id/CAKU4AWrGrs0Vk5OrZmS1gbTA2ijDH18NHKnXZTPZNuupn%2B%2Bing%40mail.gmail.com

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Dmitry Dolgov

Date:

21 May 2020, 19:51:23

> On Tue, Apr 14, 2020 at 09:09:31PM +1200, David Rowley wrote:
>
> The infrastructure (knowing the unique properties of a RelOptInfo), as
> provided by the patch Andy has been working on, which is based on my
> rough prototype version, I believe should be used for the skip scans
> patch as well.

Hi,

Following our agreement about making skip scan patch to use UniqueKeys
implementation from this thread I've rebased index skip scan on first
two patches from v8 series [1] (if I understand correctly those two are
introducing the concept, and others are just using it). I would like to
clarify couple of things:

* It seems populate_baserel_uniquekeys, which actually sets uniquekeys,
  is called after create_index_paths, where index skip scan already
  needs to use them. Is it possible to call it earlier?

* Do I understand correctly, that a UniqueKey would be created in a
  simplest case only when an index is unique? This makes it different
  from what was implemented for index skip scan, since there UniqueKeys
  also represents potential to use non-unique index to facilitate search
  for unique values via skipping.

[1]: https://www.postgresql.org/message-id/CAKU4AWpOM3_J-B%3DwQtCeU1TGr89MhpJBBkv2he1tAeQz6i4XNw%40mail.gmail.com

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

21 May 2020, 20:40:17

On Fri, 22 May 2020 at 07:49, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> * It seems populate_baserel_uniquekeys, which actually sets uniquekeys,
>   is called after create_index_paths, where index skip scan already
>   needs to use them. Is it possible to call it earlier?

Seems reasonable. I originally put it after check_index_predicates().
We need to wait until at least then so we can properly build
UniqueKeys for partial indexes.

> * Do I understand correctly, that a UniqueKey would be created in a
>   simplest case only when an index is unique? This makes it different
>   from what was implemented for index skip scan, since there UniqueKeys
>   also represents potential to use non-unique index to facilitate search
>   for unique values via skipping.

The way I picture the two working together is that the core UniqueKey
patch adds UniqueKeys to RelOptInfos to allow us to have knowledge
about what they're unique on based on the base relation's unique
indexes.

For Skipscans, that patch must expand on UniqueKeys to teach Paths
about them. I imagine we'll set some required UniqueKeys during
standard_qp_callback() and then we'll try to create some Skip Scan
paths (which are marked with UniqueKeys) if the base relation does not
already have UniqueKeys that satisfy the required UniqueKeys that were
set during standard_qp_callback().  In the future, there may be other
reasons to create Skip Scan paths for certain rels, e.g if they're on
the inner side of a SEMI/ANTI join, it might be useful to try those
when planning joins.

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

21 May 2020, 20:51:45

On Thu, 14 May 2020 at 14:39, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> On Thu, May 14, 2020 at 6:20 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> Having the "onerow" flag was not how I intended it to work.
>>
> Thanks for the detailed explanation.  So I think we do need to handle onerow
> specially, (It means more things than adding each column as a UniqueKey).
> but we don't need the onerow flag since we can tell it by ukey->exprs == NIL.
>
> During the developer of this feature,  I added some Asserts as double checking
> for onerow and exprs.  it helps me to find some special cases. like
> SELECT FROM multirows  union SELECT  FROM multirows; where targetlist is NIL.
> (I find the above case returns onerow as well just now).  so onerow flag allows us
> check this special things with more attention. Even this is not the original intention
> but looks it is the one purpose now.

But surely that special case should just go in
populate_unionrel_uniquekeys(). If the targetlist is empty, then add a
UniqueKey with an empty set of exprs.

> However I am feeling that removing onerow flag doesn't require much of code
> changes. Almost all the special cases which are needed before are still needed
> after that and all the functions based on that like relation_is_onerow
> /add_uniquekey_onerow is still valid, we just need change the implementation.
> so do you think we need to remove onerow flag totally?

Well, at the moment I'm not quite understanding why it's needed. If
it's not needed then we should remove it. If it turns out there is
some valid reason for it, then we should keep it.

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

21 May 2020, 23:39:38

On Fri, May 22, 2020 at 4:40 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 22 May 2020 at 07:49, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> * It seems populate_baserel_uniquekeys, which actually sets uniquekeys,
> is called after create_index_paths, where index skip scan already
> needs to use them. Is it possible to call it earlier?

Seems reasonable. I originally put it after check_index_predicates().
We need to wait until at least then so we can properly build
UniqueKeys for partial indexes.

Looks a very valid reason, I will add this in the next version.

> * Do I understand correctly, that a UniqueKey would be created in a
> simplest case only when an index is unique? This makes it different
> from what was implemented for index skip scan, since there UniqueKeys
> also represents potential to use non-unique index to facilitate search
> for unique values via skipping.

The way I picture the two working together is that the core UniqueKey
patch adds UniqueKeys to RelOptInfos to allow us to have knowledge
about what they're unique on based on the base relation's unique
indexes.

For Skipscans, that patch must expand on UniqueKeys to teach Paths
about them. I imagine we'll set some required UniqueKeys during
standard_qp_callback() and then we'll try to create some Skip Scan
paths (which are marked with UniqueKeys) if the base relation does not
already have UniqueKeys that satisfy the required UniqueKeys that were
set during standard_qp_callback(). In the future, there may be other
reasons to create Skip Scan paths for certain rels, e.g if they're on
the inner side of a SEMI/ANTI join, it might be useful to try those
when planning joins.

Yes, In current implementation, we also add UniqueKey during create_xxx_paths,

xxx may be grouping/union. after the index skipscan patch, we can do the similar

things in create_indexskip_path.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

21 May 2020, 23:41:11

if I understand correctly those two are introducing the concept, and

others are just using it

You are understand it correctly.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

21 May 2020, 23:49:07

On Fri, May 22, 2020 at 4:52 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 14 May 2020 at 14:39, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> On Thu, May 14, 2020 at 6:20 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> Having the "onerow" flag was not how I intended it to work.
>>
> Thanks for the detailed explanation. So I think we do need to handle onerow
> specially, (It means more things than adding each column as a UniqueKey).
> but we don't need the onerow flag since we can tell it by ukey->exprs == NIL.
>
> During the developer of this feature, I added some Asserts as double checking
> for onerow and exprs. it helps me to find some special cases. like
> SELECT FROM multirows union SELECT FROM multirows; where targetlist is NIL.
> (I find the above case returns onerow as well just now). so onerow flag allows us
> check this special things with more attention. Even this is not the original intention
> but looks it is the one purpose now.

But surely that special case should just go in
populate_unionrel_uniquekeys(). If the targetlist is empty, then add a
UniqueKey with an empty set of exprs.

This is correct on this special case.

> However I am feeling that removing onerow flag doesn't require much of code
> changes. Almost all the special cases which are needed before are still needed
> after that and all the functions based on that like relation_is_onerow
> /add_uniquekey_onerow is still valid, we just need change the implementation.
> so do you think we need to remove onerow flag totally?

Well, at the moment I'm not quite understanding why it's needed. If
it's not needed then we should remove it. If it turns out there is
some valid reason for it, then we should keep it.

Currently I uses it to detect more special case which we can't image at first, we can

understand it as it used to debug/Assert purpose only. After the mainly code is

reviewed, that can be removed (based on the change is tiny).

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Dmitry Dolgov

Date:

23 May 2020, 16:16:34

> On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote:
>
> The way I picture the two working together is that the core UniqueKey
> patch adds UniqueKeys to RelOptInfos to allow us to have knowledge
> about what they're unique on based on the base relation's unique
> indexes.

Yep, I'm onboard with the idea.

> For Skipscans, that patch must expand on UniqueKeys to teach Paths
> about them.

That's already there.

> I imagine we'll set some required UniqueKeys during
> standard_qp_callback()

In standard_qp_callback, because pathkeys are computed at this point I
guess?

> and then we'll try to create some Skip Scan
> paths (which are marked with UniqueKeys) if the base relation does not
> already have UniqueKeys that satisfy the required UniqueKeys that were
> set during standard_qp_callback().

For a simple distinct query those UniqueKeys would be set based on
distinct clause. If I understand correctly, the very same is implemented
right now in create_distinct_paths, just after building all index paths,
so wouldn't it be just a duplication?

In general UniqueKeys in the skip scan patch were created from
distinctClause in build_index_paths (looks similar to what you've
described) and then based on them created index skip scan paths. So my
expectations were that the patch from this thread would work similar.

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

24 May 2020, 18:34:30

On Sun, 24 May 2020 at 04:14, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote:
> > I imagine we'll set some required UniqueKeys during
> > standard_qp_callback()
>
> In standard_qp_callback, because pathkeys are computed at this point I
> guess?

Yes. In particular, we set the pathkeys for DISTINCT clauses there.

> > and then we'll try to create some Skip Scan
> > paths (which are marked with UniqueKeys) if the base relation does not
> > already have UniqueKeys that satisfy the required UniqueKeys that were
> > set during standard_qp_callback().
>
> For a simple distinct query those UniqueKeys would be set based on
> distinct clause. If I understand correctly, the very same is implemented
> right now in create_distinct_paths, just after building all index paths,
> so wouldn't it be just a duplication?

I think we need to create the skip scan paths when we create the other
paths for base relations. We shouldn't be adjusting existing index
paths during create_distinct_paths().  The last code I saw for the
skip scans patch did something like if (IsA(path, IndexScanPath)) in
create_distinct_paths(), but that's only ever going to work when the
query is to a single relation. You'll never see IndexScanPaths in the
upper planner's paths when there are joins. You'd see join type paths
instead.  It is possible to make use of skip scans for DISTINCT when
the query has joins. We'd just need to ensure the join does not
duplicate the unique rows from the skip scanned relation.

> In general UniqueKeys in the skip scan patch were created from
> distinctClause in build_index_paths (looks similar to what you've
> described) and then based on them created index skip scan paths. So my
> expectations were that the patch from this thread would work similar.

The difference will be that you'd be setting some distinct_uniquekeys
in standard_qp_callback() to explicitly request that some skip scan
paths be created for the uniquekeys, whereas the patch here just does
not bother doing DISTINCT if the upper relation already has unique
keys that state that the DISTINCT is not required. The skip scans
patch should check if the RelOptInfo for the uniquekeys set in
standard_qp_callback() are already mentioned in the RelOptInfo's
uniquekeys.  If they are then there's no point in skip scanning as the
rel is already unique for the distinct_uniquekeys.

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Dmitry Dolgov

Date:

25 May 2020, 07:16:26

> On Mon, May 25, 2020 at 06:34:30AM +1200, David Rowley wrote:
>
> > For a simple distinct query those UniqueKeys would be set based on
> > distinct clause. If I understand correctly, the very same is implemented
> > right now in create_distinct_paths, just after building all index paths,
> > so wouldn't it be just a duplication?
>
> I think we need to create the skip scan paths when we create the other
> paths for base relations. We shouldn't be adjusting existing index
> paths during create_distinct_paths().  The last code I saw for the
> skip scans patch did something like if (IsA(path, IndexScanPath)) in
> create_distinct_paths()

It's not the case since the late March.

> > In general UniqueKeys in the skip scan patch were created from
> > distinctClause in build_index_paths (looks similar to what you've
> > described) and then based on them created index skip scan paths. So my
> > expectations were that the patch from this thread would work similar.
>
> The difference will be that you'd be setting some distinct_uniquekeys
> in standard_qp_callback() to explicitly request that some skip scan
> paths be created for the uniquekeys, whereas the patch here just does
> not bother doing DISTINCT if the upper relation already has unique
> keys that state that the DISTINCT is not required. The skip scans
> patch should check if the RelOptInfo for the uniquekeys set in
> standard_qp_callback() are already mentioned in the RelOptInfo's
> uniquekeys.  If they are then there's no point in skip scanning as the
> rel is already unique for the distinct_uniquekeys.

It sounds like it makes semantics of UniqueKey a bit more confusing,
isn't it? At the moment it says:

     Represents the unique properties held by a RelOptInfo.

With the proposed changes it would be "unique properties, that are held"
and "unique properties, that are requested", which are partially
duplicated, but stored in some different fields. From the skip scan
patch perspective it's probably doesn't make any difference, seems like
the implementation would be almost the same, just created UniqueKeys
would be of different type. But I'm afraid potentiall future users of
UniqueKeys could be easily confused.

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

05 June 2020, 00:26:15

On Mon, 25 May 2020 at 19:14, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Mon, May 25, 2020 at 06:34:30AM +1200, David Rowley wrote:
> > The difference will be that you'd be setting some distinct_uniquekeys
> > in standard_qp_callback() to explicitly request that some skip scan
> > paths be created for the uniquekeys, whereas the patch here just does
> > not bother doing DISTINCT if the upper relation already has unique
> > keys that state that the DISTINCT is not required. The skip scans
> > patch should check if the RelOptInfo for the uniquekeys set in
> > standard_qp_callback() are already mentioned in the RelOptInfo's
> > uniquekeys.  If they are then there's no point in skip scanning as the
> > rel is already unique for the distinct_uniquekeys.
>
> It sounds like it makes semantics of UniqueKey a bit more confusing,
> isn't it? At the moment it says:
>
>      Represents the unique properties held by a RelOptInfo.
>
> With the proposed changes it would be "unique properties, that are held"
> and "unique properties, that are requested", which are partially
> duplicated, but stored in some different fields. From the skip scan
> patch perspective it's probably doesn't make any difference, seems like
> the implementation would be almost the same, just created UniqueKeys
> would be of different type. But I'm afraid potentiall future users of
> UniqueKeys could be easily confused.

If there's some comment that says UniqueKeys are for RelOptInfos, then
perhaps that comment just needs to be expanded to mention the Path
uniqueness when we add the uniquekeys field to Path.

I think the main point of basing skip scans on top of this uniquekeys
patch is to ensure it's the right thing for the job. I don't think
it's realistic to be maintaining two different sets of infrastructure
which serve a very similar purpose. It's important we make UniqueKeys
general purpose enough to support future useful forms of optimisation.
Basing skip scans on it seems like a good exercise towards that. I'm
not expecting that we need to make zero changes here to allow it to
work well with skip scans.

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

05 June 2020, 02:36:39

On Mon, May 25, 2020 at 2:34 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Sun, 24 May 2020 at 04:14, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote:
> > I imagine we'll set some required UniqueKeys during
> > standard_qp_callback()
>
> In standard_qp_callback, because pathkeys are computed at this point I
> guess?

Yes. In particular, we set the pathkeys for DISTINCT clauses there.

Actually I have some issues to understand from here, then try to read index

skip scan patch to fully understand what is the requirement, but that doesn't

get it so far[1]. So what is the "UniqueKeys" in "UniqueKeys during

standard_qp_callback()" and what is the "pathkeys" in "pathkeys are computed

at this point” means? I tried to think it as root->distinct_pathkeys, however I

didn't fully understand where root->distinct_pathkeys is used for as well.

[1] https://www.postgresql.org/message-id/CAKU4AWq%3DwWkAo-CDOQ5Ea6UwYvZCgb501w6iqU0rtnTT-zg6bQ%40mail.gmail.com

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

05 June 2020, 02:57:09

On Fri, 5 Jun 2020 at 14:36, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> On Mon, May 25, 2020 at 2:34 AM David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> On Sun, 24 May 2020 at 04:14, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> >
>> > > On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote:
>> > > I imagine we'll set some required UniqueKeys during
>> > > standard_qp_callback()
>> >
>> > In standard_qp_callback, because pathkeys are computed at this point I
>> > guess?
>>
>> Yes. In particular, we set the pathkeys for DISTINCT clauses there.
>>
>
> Actually I have some issues to understand from here,  then try to read index
> skip scan patch to fully understand what is the requirement, but that doesn't
> get it so far[1].  So what  is the "UniqueKeys" in "UniqueKeys during
> standard_qp_callback()" and what is the "pathkeys" in "pathkeys are computed
> at this point” means?  I tried to think it as root->distinct_pathkeys,  however I
> didn't fully understand where root->distinct_pathkeys is used for as well.

In standard_qp_callback(), what we'll do with uniquekeys is pretty
much what we already do with pathkeys there. Basically pathkeys are
set there to have the planner attempt to produce a plan that satisfies
those pathkeys.  Notice at the end of standard_qp_callback() we set
the pathkeys according to the first upper planner operation that'll
need to make use of those pathkeys.  e.g, If there's a GROUP BY and a
DISTINCT in the query, then use the pathkeys for GROUP BY, since that
must occur before DISTINCT.  Likely uniquekeys will want to follow the
same rules there for the operations that can make use of paths with
uniquekeys, which in this case, I believe, will be the same as the
example I just mentioned for pathkeys, except we'll only be able to
support GROUP BY without any aggregate functions.

David

> [1] https://www.postgresql.org/message-id/CAKU4AWq%3DwWkAo-CDOQ5Ea6UwYvZCgb501w6iqU0rtnTT-zg6bQ%40mail.gmail.com

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

05 June 2020, 04:20:58

On Fri, Jun 5, 2020 at 10:57 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 5 Jun 2020 at 14:36, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> On Mon, May 25, 2020 at 2:34 AM David Rowley <dgrowleyml@gmail.com> wrote:
>>
>> On Sun, 24 May 2020 at 04:14, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>> >
>> > > On Fri, May 22, 2020 at 08:40:17AM +1200, David Rowley wrote:
>> > > I imagine we'll set some required UniqueKeys during
>> > > standard_qp_callback()
>> >
>> > In standard_qp_callback, because pathkeys are computed at this point I
>> > guess?
>>
>> Yes. In particular, we set the pathkeys for DISTINCT clauses there.
>>
>
> Actually I have some issues to understand from here, then try to read index
> skip scan patch to fully understand what is the requirement, but that doesn't
> get it so far[1]. So what is the "UniqueKeys" in "UniqueKeys during
> standard_qp_callback()" and what is the "pathkeys" in "pathkeys are computed
> at this point” means? I tried to think it as root->distinct_pathkeys, however I
> didn't fully understand where root->distinct_pathkeys is used for as well.

In standard_qp_callback(), what we'll do with uniquekeys is pretty
much what we already do with pathkeys there. Basically pathkeys are
set there to have the planner attempt to produce a plan that satisfies
those pathkeys. Notice at the end of standard_qp_callback() we set

the pathkeys according to the first upper planner operation that'll
need to make use of those pathkeys. e.g, If there's a GROUP BY and a
DISTINCT in the query, then use the pathkeys for GROUP BY, since that
must occur before DISTINCT.

Thanks for your explanation. Looks I understand now based on your comments.

Take root->group_pathkeys for example, the similar information also available in

root->parse->groupClauses but we do use of root->group_pathkeys with

pathkeys_count_contained_in function in many places, that is mainly because

the content between between the 2 is different some times, like the case in

pathkey_is_redundant.

Likely uniquekeys will want to follow the
same rules there for the operations that can make use of paths with
uniquekeys, which in this case, I believe, will be the same as the
example I just mentioned for pathkeys, except we'll only be able to
support GROUP BY without any aggregate functions.

All the places I want to use UniqueKey so far (like distinct, group by and others)

have an input_relation (RelOptInfo), and the UniqueKey information can be get

there. at the same time, all the pathkey in PlannerInfo is used for Upper planner

but UniqueKey may be used in current planner some time, like reduce_semianti_joins/

remove_useless_join, I am not sure if we must maintain uniquekey in PlannerInfo.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Dmitry Dolgov

Date:

06 June 2020, 09:17:51

> On Fri, Jun 05, 2020 at 12:26:15PM +1200, David Rowley wrote:
> On Mon, 25 May 2020 at 19:14, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> >
> > > On Mon, May 25, 2020 at 06:34:30AM +1200, David Rowley wrote:
> > > The difference will be that you'd be setting some distinct_uniquekeys
> > > in standard_qp_callback() to explicitly request that some skip scan
> > > paths be created for the uniquekeys, whereas the patch here just does
> > > not bother doing DISTINCT if the upper relation already has unique
> > > keys that state that the DISTINCT is not required. The skip scans
> > > patch should check if the RelOptInfo for the uniquekeys set in
> > > standard_qp_callback() are already mentioned in the RelOptInfo's
> > > uniquekeys.  If they are then there's no point in skip scanning as the
> > > rel is already unique for the distinct_uniquekeys.
> >
> > It sounds like it makes semantics of UniqueKey a bit more confusing,
> > isn't it? At the moment it says:
> >
> >      Represents the unique properties held by a RelOptInfo.
> >
> > With the proposed changes it would be "unique properties, that are held"
> > and "unique properties, that are requested", which are partially
> > duplicated, but stored in some different fields. From the skip scan
> > patch perspective it's probably doesn't make any difference, seems like
> > the implementation would be almost the same, just created UniqueKeys
> > would be of different type. But I'm afraid potentiall future users of
> > UniqueKeys could be easily confused.
>
> If there's some comment that says UniqueKeys are for RelOptInfos, then
> perhaps that comment just needs to be expanded to mention the Path
> uniqueness when we add the uniquekeys field to Path.

My concerns are more about having two different sets of distinct
uniquekeys:

* one prepared in standard_qp_callback for skip scan (I guess those
  should be added to PlannerInfo?)

* one in create_distinct_paths as per current implementation

with what seems to be similar content.

> I think the main point of basing skip scans on top of this uniquekeys
> patch is to ensure it's the right thing for the job. I don't think
> it's realistic to be maintaining two different sets of infrastructure
> which serve a very similar purpose. It's important we make UniqueKeys
> general purpose enough to support future useful forms of optimisation.
> Basing skip scans on it seems like a good exercise towards that. I'm
> not expecting that we need to make zero changes here to allow it to
> work well with skip scans.

Sure, no one suggests to have two ways of saying "this thing is unique".
I'm just trying to figure out how to make skip scan and uniquekeys play
together without having rough edges.

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

07 June 2020, 06:51:22

On Sat, 6 Jun 2020 at 21:15, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> My concerns are more about having two different sets of distinct
> uniquekeys:
>
> * one prepared in standard_qp_callback for skip scan (I guess those
>   should be added to PlannerInfo?)

Yes. Those must be set so that we know if and what we should try to
create Skip Scan Index paths for. Just like we'll create index paths
for PlannerInfo.query_pathkeys.

> * one in create_distinct_paths as per current implementation
>
> with what seems to be similar content.

I think we need to have UniqueKeys in RelOptInfo so we can describe
what a relation is unique by.  There's no point for example in
creating skip scan paths for a relation that's already unique on
whatever we might try to skip scan on. e.g someone does:

SELECT DISTINCT unique_and_indexed_column FROM tab;

Since there's a unique index on unique_and_indexed_column then we
needn't try to create a skipscan path for it.

However, the advantages of having UniqueKeys on the RelOptInfo goes a
little deeper than that.  We can make use of it anywhere where we
currently do relation_has_unique_index_for() for. Plus we get what
Andy wants and can skip useless DISTINCT operations when the result is
already unique on the distinct clause.  Sure we could carry all the
relation's unique properties around in Paths, but that's not the right
place. It's logically a property of the relation, not the path
specifically.  RelOptInfo is a good place to store the properties of
relations.

The idea of the meaning of uniquekeys within a path is that the path
is specifically making those keys unique.  We're not duplicating the
RelOptInfo's uniquekeys there.

If we have a table like:

CREATE TABLE tab (
   a INT PRIMARY KEY,
   b INT NOT NULL
);

CREATE INDEX tab_b_idx ON tab (b);

Then I'd expect a query such as: SELECT DISTINCT b FROM tab; to have
the uniquekeys for tab's RelOptInfo set to {a}, and the seqscan and
index scan paths uniquekey properties set to NULL, but the skipscan
index path uniquekeys for tab_b_idx set to {b}.  Then when we go
create the distinct paths Andy's work will see that there's no
RelOptInfo uniquekeys for the distinct clause, but the skip scan work
will loop over the unique_pathlist and find that we have a skipscan
path with the required uniquekeys, a.k.a {b}.

Does that make sense?

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Dmitry Dolgov

Date:

09 June 2020, 10:29:13

> On Sun, Jun 07, 2020 at 06:51:22PM +1200, David Rowley wrote:
>
> > * one in create_distinct_paths as per current implementation
> >
> > with what seems to be similar content.
>
> I think we need to have UniqueKeys in RelOptInfo so we can describe
> what a relation is unique by.  There's no point for example in
> creating skip scan paths for a relation that's already unique on
> whatever we might try to skip scan on. e.g someone does:
>
> SELECT DISTINCT unique_and_indexed_column FROM tab;
>
> Since there's a unique index on unique_and_indexed_column then we
> needn't try to create a skipscan path for it.
>
> However, the advantages of having UniqueKeys on the RelOptInfo goes a
> little deeper than that.  We can make use of it anywhere where we
> currently do relation_has_unique_index_for() for. Plus we get what
> Andy wants and can skip useless DISTINCT operations when the result is
> already unique on the distinct clause.  Sure we could carry all the
> relation's unique properties around in Paths, but that's not the right
> place. It's logically a property of the relation, not the path
> specifically.  RelOptInfo is a good place to store the properties of
> relations.
>
> The idea of the meaning of uniquekeys within a path is that the path
> is specifically making those keys unique.  We're not duplicating the
> RelOptInfo's uniquekeys there.
>
> If we have a table like:
>
> CREATE TABLE tab (
>    a INT PRIMARY KEY,
>    b INT NOT NULL
> );
>
> CREATE INDEX tab_b_idx ON tab (b);
>
> Then I'd expect a query such as: SELECT DISTINCT b FROM tab; to have
> the uniquekeys for tab's RelOptInfo set to {a}, and the seqscan and
> index scan paths uniquekey properties set to NULL, but the skipscan
> index path uniquekeys for tab_b_idx set to {b}.  Then when we go
> create the distinct paths Andy's work will see that there's no
> RelOptInfo uniquekeys for the distinct clause, but the skip scan work
> will loop over the unique_pathlist and find that we have a skipscan
> path with the required uniquekeys, a.k.a {b}.
>
> Does that make sense?

Yes, from this point of view it makes sense. I've already posted the
first version of index skip scan based on this implementation [1]. There
could be rought edges, but overall I hope we're on the same page.

[1]: https://www.postgresql.org/message-id/flat/20200609102247.jdlatmfyeecg52fi%40localhost

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

29 June 2020, 09:59:21

I just did another self-review about this patch and took some suggestions based

on the discussion above. The attached is the v9 version. When you check the

uniquekey patch, README.uniquekey should be a good place to start with.

Main changes in v9 includes:

1. called populate_baserel_uniquekeys after check_index_predicates.

2. removed the UniqueKey->onerow flag since we can tell it by exprs == NIL.

3. expression index code improvement.

4. code & comments refactoring.

As for the Index Skip Scan, I still have not merged the changes in the Index

Skip Scan patch[1]. We may need some addition for that, but probably not

need to modify the existing code. After we can finalize it, we can add it in

that patch. I will keep a close eye on it as well.

[1]: https://www.postgresql.org/message-id/flat/20200609102247.jdlatmfyeecg52fi%40localhost

Best Regards

Andy Fan

Attachment

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

19 July 2020, 03:03:26

Fixed a test case in v10.

Best Regards

Andy Fan

Attachment

RE: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Floris Van Nee

Date:

22 July 2020, 19:22:09

Hi Andy,

A small thing I found:

+static List *

+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,

+ List *const_exprs,

+ List *const_expr_opfamilies,

+ Bitmapset *used_varattrs,

+ bool *useful,

+ bool *multi_nullvals)

…

+ indexpr_item = list_head(unique_index->indexprs);

+ for(c = 0; c < unique_index->ncolumns; c++)

+ {

I believe the for loop must be over unique_index->nkeycolumns, rather than columns. It shouldn’t include the extra non-key columns. This can currently lead to invalid memory accesses as well a few lines later when it does an array access of unique_index->opfamily[c] – this array only has nkeycolumns entries.

-Floris

From: Andy Fan <zhihui.fan1213@gmail.com>
Sent: Sunday 19 July 2020 5:03 AM
To: Dmitry Dolgov <9erthalion6@gmail.com>
Cc: David Rowley <dgrowleyml@gmail.com>; PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Tom Lane <tgl@sss.pgh.pa.us>; Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>; rushabh.lathia@gmail.com
Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey [External]

Fixed a test case in v10.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

03 August 2020, 22:59:50

Hi Floris:

On Thu, Jul 23, 2020 at 3:22 AM Floris Van Nee <florisvannee@optiver.com> wrote:

Hi Andy,

A small thing I found:

+static List *
+get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
+                                                                                                List *const_exprs,
+                                                                                                List *const_expr_opfamilies,
+                                                                                                Bitmapset *used_varattrs,
+                                                                                                bool *useful,
+                                                                                                bool *multi_nullvals)
…
+             indexpr_item = list_head(unique_index->indexprs);
+             for(c = 0; c < unique_index->ncolumns; c++)
+             {

I believe the for loop must be over unique_index->nkeycolumns, rather than columns. It shouldn’t include the extra non-key columns. This can currently lead to invalid memory accesses as well a few lines later when it does an array access of unique_index->opfamily[c] – this array only has nkeycolumns entries.

You are correct, I would include this in the next version patch, Thank you

for this checking!

Andy Fan

Best Regards

From: Andy Fan <zhihui.fan1213@gmail.com>
Sent: Sunday 19 July 2020 5:03 AM
To: Dmitry Dolgov <9erthalion6@gmail.com>
Cc: David Rowley <dgrowleyml@gmail.com>; PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Tom Lane <tgl@sss.pgh.pa.us>; Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>; rushabh.lathia@gmail.com
Subject: Re: [PATCH] Keeps tracking the uniqueness with UniqueKey [External]

Fixed a test case in v10.

--
Best Regards
Andy Fan

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Michael Paquier

Date:

07 September 2020, 07:22:16

On Tue, Aug 04, 2020 at 06:59:50AM +0800, Andy Fan wrote:
> You are correct, I would include this in the next version patch, Thank you
> for this checking!

Regression tests are failing with this patch set applied.  The CF bot
says so, and I can reproduce that locally as well.  Could you look at
that please?  I have switched the patch to "waiting on author".
--
Michael

Attachment

signature.asc

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

08 September 2020, 23:51:12

On Mon, Sep 7, 2020 at 3:22 PM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Aug 04, 2020 at 06:59:50AM +0800, Andy Fan wrote:
> You are correct, I would include this in the next version patch, Thank you
> for this checking!

Regression tests are failing with this patch set applied. The CF bot
says so, and I can reproduce that locally as well. Could you look at
that please? I have switched the patch to "waiting on author".
--
Michael

Thank you Michael for checking it, I can reproduce the same locally after

rebasing to the latest master. The attached v11 has fixed it and includes

the fix Floris found.

The status of this patch is we are still in discussion about which data type should
UniqueKey->expr use. Both David [1] and I [2] shared some thinking about
EquivalenceClasses, but neither of us have decided on it. So I still didn't change

anything about that now. I can change it once we have decided on it.

[1] https://www.postgresql.org/message-id/CAApHDvoDMyw%3DhTuW-258yqNK4bhW6CpguJU_GZBh4x%2Brnoem3w%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAKU4AWqy3Uv67%3DPR8RXG6LVoO-cMEwfW_LMwTxHdGrnu%2Bcf%2BdA%40mail.gmail.com

Best Regards

Andy Fan

Attachment

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Dmitry Dolgov

Date:

07 October 2020, 13:55:32

> On Wed, Sep 09, 2020 at 07:51:12AM +0800, Andy Fan wrote:
>
> Thank you Michael for checking it, I can reproduce the same locally after
> rebasing to the latest master. The attached v11 has fixed it and includes
> the fix Floris found.
>
> The status of this patch is we are still in discussion about which data
> type should
> UniqueKey->expr use.  Both David [1] and I [2] shared some thinking about
> EquivalenceClasses, but neither of us have decided on it. So I still didn't
> change
> anything about that  now.   I can change it once we have decided on it.
>
> [1]
> https://www.postgresql.org/message-id/CAApHDvoDMyw%3DhTuW-258yqNK4bhW6CpguJU_GZBh4x%2Brnoem3w%40mail.gmail.com
>
> [2]
> https://www.postgresql.org/message-id/CAKU4AWqy3Uv67%3DPR8RXG6LVoO-cMEwfW_LMwTxHdGrnu%2Bcf%2BdA%40mail.gmail.com

Hi,

In the Index Skip Scan thread Peter mentioned couple of issues that I
believe need to be addressed here. In fact one about valgrind errors was
already fixed as far as I see (nkeycolumns instead of ncolumns), another
one was:

/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
  797 |   else if (!list_member(unique_index->rel->reltarget->exprs, expr))
      |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Other than that I wanted to ask what are the plans to proceed with this
patch? It's been a while since the question was raised in which format
to keep unique key expressions, and as far as I can see no detailed
suggestions or patch changes were proposed as a follow up. Obviously I
would love to see the first two preparation patches committed to avoid
dependencies between patches, and want to suggest an incremental
approach with simple format for start (what we have right now) with the
idea how to extend it in the future to cover more cases.

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

08 October 2020, 01:34:51

On Wed, Oct 7, 2020 at 9:55 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Wed, Sep 09, 2020 at 07:51:12AM +0800, Andy Fan wrote:
>
> Thank you Michael for checking it, I can reproduce the same locally after
> rebasing to the latest master. The attached v11 has fixed it and includes
> the fix Floris found.
>
> The status of this patch is we are still in discussion about which data
> type should
> UniqueKey->expr use. Both David [1] and I [2] shared some thinking about
> EquivalenceClasses, but neither of us have decided on it. So I still didn't
> change
> anything about that now. I can change it once we have decided on it.
>
> [1]
> https://www.postgresql.org/message-id/CAApHDvoDMyw%3DhTuW-258yqNK4bhW6CpguJU_GZBh4x%2Brnoem3w%40mail.gmail.com
>
> [2]
> https://www.postgresql.org/message-id/CAKU4AWqy3Uv67%3DPR8RXG6LVoO-cMEwfW_LMwTxHdGrnu%2Bcf%2BdA%40mail.gmail.com

Hi,

In the Index Skip Scan thread Peter mentioned couple of issues that I
believe need to be addressed here. In fact one about valgrind errors was
already fixed as far as I see (nkeycolumns instead of ncolumns), another
one was:

/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:
In function ‘populate_baserel_uniquekeys’:
/code/postgresql/patch/build/../source/src/backend/optimizer/path/uniquekeys.c:797:13:
warning: ‘expr’ may be used uninitialized in this function
[-Wmaybe-uninitialized]
797 | else if (!list_member(unique_index->rel->reltarget->exprs, expr))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I can fix this warning in the next version, thanks for reporting it. It can be

fixed like below or just adjust the if-elseif-else pattern.

--- a/src/backend/optimizer/path/uniquekeys.c
+++ b/src/backend/optimizer/path/uniquekeys.c
@@ -760,6 +760,7 @@ get_exprs_from_uniqueindex(IndexOptInfo *unique_index,
{
/* Index on system column is not supported */
Assert(false);
+ expr = NULL; /* make compiler happy */

}

Other than that I wanted to ask what are the plans to proceed with this
patch? It's been a while since the question was raised in which format
to keep unique key expressions, and as far as I can see no detailed
suggestions or patch changes were proposed as a follow up. Obviously I
would love to see the first two preparation patches committed to avoid
dependencies between patches, and want to suggest an incremental
approach with simple format for start (what we have right now) with the
idea how to extend it in the future to cover more cases.

I think the hardest part of this series is commit 2, it probably needs lots of

dedicated time to review which would be the hardest part for the reviewers.

I don't have a good suggestion, however.

Best Regards

Andy Fan

RE: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

"Hou, Zhijie"

Date:

08 October 2020, 04:12:37

Hi

I have a look over this patch and find some typos in 0002.

1.Some typos about unique:
There are some spelling mistakes about "unique" in code comments and README.
Such as: "+However we define the UnqiueKey as below."

2.function name about initililze_uniquecontext_for_joinrel:
May be it should be initialize_ uniquecontext_for_joinrel.

3.some typos in comment:
+             * baserelation's basicrestrictinfo. so it must be in ON clauses.

I think it shoule be " basicrestrictinfo " => "baserestrictinfo".


Besides, I think list_copy can be used to simplify the following code.
(But It seems the type of expr is still in discussion, so this may has no impact )
+    List    *exprs = NIL;
...
+    foreach(lc, unionrel->reltarget->exprs)
+    {
+        exprs = lappend(exprs, lfirst(lc));
+    }

Best regards,

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Dmitry Dolgov

Date:

08 October 2020, 10:39:38

> On Thu, Oct 08, 2020 at 09:34:51AM +0800, Andy Fan wrote:
>
> > Other than that I wanted to ask what are the plans to proceed with this
> > patch? It's been a while since the question was raised in which format
> > to keep unique key expressions, and as far as I can see no detailed
> > suggestions or patch changes were proposed as a follow up. Obviously I
> > would love to see the first two preparation patches committed to avoid
> > dependencies between patches, and want to suggest an incremental
> > approach with simple format for start (what we have right now) with the
> > idea how to extend it in the future to cover more cases.
> >
>
> I think the hardest part of this series is commit 2,  it probably needs
> lots of
> dedicated time to review which would be the hardest part for the reviewers.
> I don't have a good suggestion, however.

Sure, and I would review the patch as well. But as far as I understand
the main issue is "how to store uniquekey expressions", and as long as
it is not decided, no additional review will move the patch forward I
guess.

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

12 October 2020, 02:32:39

On Thu, Oct 8, 2020 at 6:39 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> On Thu, Oct 08, 2020 at 09:34:51AM +0800, Andy Fan wrote:
>
> > Other than that I wanted to ask what are the plans to proceed with this
> > patch? It's been a while since the question was raised in which format
> > to keep unique key expressions, and as far as I can see no detailed
> > suggestions or patch changes were proposed as a follow up. Obviously I
> > would love to see the first two preparation patches committed to avoid
> > dependencies between patches, and want to suggest an incremental
> > approach with simple format for start (what we have right now) with the
> > idea how to extend it in the future to cover more cases.
> >
>
> I think the hardest part of this series is commit 2, it probably needs
> lots of
> dedicated time to review which would be the hardest part for the reviewers.
> I don't have a good suggestion, however.

Sure, and I would review the patch as well.

Thank you very much!

But as far as I understand
the main issue is "how to store uniquekey expressions", and as long as
it is not decided, no additional review will move the patch forward I
guess.

I don't think so:) The patch may have other issues as well. For example,

logic error or duplicated code or cases needing improvement and so on.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

12 October 2020, 02:33:17

On Thu, Oct 8, 2020 at 12:12 PM Hou, Zhijie <houzj.fnst@cn.fujitsu.com> wrote:

Hi

I have a look over this patch and find some typos in 0002.

1.Some typos about unique:
There are some spelling mistakes about "unique" in code comments and README.
Such as: "+However we define the UnqiueKey as below."

2.function name about initililze_uniquecontext_for_joinrel:
May be it should be initialize_ uniquecontext_for_joinrel.

3.some typos in comment:
+ * baserelation's basicrestrictinfo. so it must be in ON clauses.

I think it shoule be " basicrestrictinfo " => "baserestrictinfo".

Besides, I think list_copy can be used to simplify the following code.
(But It seems the type of expr is still in discussion, so this may has no impact )
+ List *exprs = NIL;
...
+ foreach(lc, unionrel->reltarget->exprs)
+ {
+ exprs = lappend(exprs, lfirst(lc));
+ }

Best regards,

Thank you zhijie, I will fix them in next version.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

26 November 2020, 14:58:12

This patch has stopped moving for a while, any suggestion about

how to move on is appreciated.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Heikki Linnakangas

Date:

30 November 2020, 10:04:34

On 26/11/2020 16:58, Andy Fan wrote:
> This patch has stopped moving for a while,  any suggestion about
> how to move on is appreciated.

The question on whether UniqueKey.exprs should be a list of 
EquivalenceClasses or PathKeys is unresolved. I don't have an opinion on 
that, but I'd suggest that you pick one or the other and just go with 
it. If it turns out to be a bad choice, then we'll change it.

Quickly looking at the patches, there's one thing I think no one's 
mentioned yet, but looks really ugly to me:

> +        /* Make sure the path->parent point to current joinrel, can't update it in-place. */
> +        foreach(lc, outer_rel->pathlist)
> +        {
> +            Size sz = size_of_path(lfirst(lc));
> +            Path *path = palloc(sz);
> +            memcpy(path, lfirst(lc), sz);
> +            path->parent = joinrel;
> +            add_path(joinrel, path);
> +        }

Copying a Path and modifying it like that is not good, there's got to be 
a better way to do this. Perhaps wrap the original Paths in 
ProjectionPaths, where the ProjectionPath's parent is the joinrel and 
dummypp=true.

- Heikki

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Jesper Pedersen

Date:

30 November 2020, 14:30:19

Hi,

On 11/30/20 5:04 AM, Heikki Linnakangas wrote:
> On 26/11/2020 16:58, Andy Fan wrote:
>> This patch has stopped moving for a while,  any suggestion about
>> how to move on is appreciated.
>
> The question on whether UniqueKey.exprs should be a list of 
> EquivalenceClasses or PathKeys is unresolved. I don't have an opinion 
> on that, but I'd suggest that you pick one or the other and just go 
> with it. If it turns out to be a bad choice, then we'll change it.

In this case I think it is matter of deciding if we are going to use 
EquivalenceClasses or Exprs before going further; there has been work 
ongoing in this area for a while, so having a clear direction from a 
committer would be greatly appreciated.

Deciding would also help potential reviewers to give more feedback on 
the features implemented on top of the base.

Should there be a new thread with the minimum requirements in order to 
get closer ?

Best regards,
  Jesper

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Heikki Linnakangas

Date:

30 November 2020, 15:20:18

On 30/11/2020 16:30, Jesper Pedersen wrote:
> On 11/30/20 5:04 AM, Heikki Linnakangas wrote:
>> On 26/11/2020 16:58, Andy Fan wrote:
>>> This patch has stopped moving for a while,  any suggestion about
>>> how to move on is appreciated.
>>
>> The question on whether UniqueKey.exprs should be a list of
>> EquivalenceClasses or PathKeys is unresolved. I don't have an opinion
>> on that, but I'd suggest that you pick one or the other and just go
>> with it. If it turns out to be a bad choice, then we'll change it.
> 
> In this case I think it is matter of deciding if we are going to use
> EquivalenceClasses or Exprs before going further; there has been work
> ongoing in this area for a while, so having a clear direction from a
> committer would be greatly appreciated.

Plain Exprs are not good enough, because you need to know which operator 
the expression is unique on. Usually, it's the default = operator in the 
default btree opclass for the datatype, but it could be something else, too.

There's some precedence for PathKeys, as we generate PathKeys to 
represent the DISTINCT column in PlannerInfo->distinct_pathkeys. On the 
other hand, I've always found it confusing that we use PathKeys to 
represent DISTINCT and GROUP BY, which are not actually sort orderings. 
Perhaps it would  make sense to store EquivalenceClass+opfamily in 
UniqueKey, and also replace distinct_pathkeys and group_pathkeys with 
UniqueKeys.

That's just my 2 cents though, others more familiar with this planner 
code might have other opinions...

- Heikki

RE: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

"Hou, Zhijie"

Date:

01 December 2020, 05:45:35

Hi 

I look into the patch again and have some comments.

1.
+    Size oid_cmp_len = sizeof(Oid) * ind1->ncolumns;
+
+    return ind1->ncolumns == ind2->ncolumns &&
+        ind1->unique == ind2->unique &&
+        memcmp(ind1->indexkeys, ind2->indexkeys, sizeof(int) * ind1->ncolumns) == 0 &&
+        memcmp(ind1->opfamily, ind2->opfamily, oid_cmp_len) == 0 &&
+        memcmp(ind1->opcintype, ind2->opcintype, oid_cmp_len) == 0 &&
+        memcmp(ind1->sortopfamily, ind2->sortopfamily, oid_cmp_len) == 0 &&
+        equal(get_tlist_exprs(ind1->indextlist, true),
+              get_tlist_exprs(ind2->indextlist, true));

The length of sortopfamily,opfamily and opcintype seems ->nkeycolumns not ->ncolumns.
I checked function get_relation_info where init the IndexOptInfo.
(If there are more places where can change the length, please correct me)


2.

+    COPY_SCALAR_FIELD(ncolumns);
+    COPY_SCALAR_FIELD(nkeycolumns);
+    COPY_SCALAR_FIELD(unique);
+    COPY_SCALAR_FIELD(immediate);
+    /* We just need to know if it is NIL or not */
+    COPY_SCALAR_FIELD(indpred);
+    COPY_SCALAR_FIELD(predOK);
+    COPY_POINTER_FIELD(indexkeys, from->ncolumns * sizeof(int));
+    COPY_POINTER_FIELD(indexcollations, from->ncolumns * sizeof(Oid));
+    COPY_POINTER_FIELD(opfamily, from->ncolumns * sizeof(Oid));
+    COPY_POINTER_FIELD(opcintype, from->ncolumns * sizeof(Oid));
+    COPY_POINTER_FIELD(sortopfamily, from->ncolumns * sizeof(Oid));
+    COPY_NODE_FIELD(indextlist);

The same as 1.
Should use nkeycolumns if I am right.


3.
+    foreach(lc, newnode->indextlist)
+    {
+        TargetEntry *tle = lfirst_node(TargetEntry, lc);
+        /* Index on expression is ignored */
+        Assert(IsA(tle->expr, Var));
+        tle->expr = (Expr *) find_parent_var(appinfo, (Var *) tle->expr);
+        newnode->indexkeys[idx] = castNode(Var, tle->expr)->varattno;
+        idx++;
+    }

The count variable 'idx'  can be replaces by foreach_current_index().


Best regards,
houzj

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

05 December 2020, 15:10:28

Thank you Heikki for your attention.

On Mon, Nov 30, 2020 at 11:20 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 30/11/2020 16:30, Jesper Pedersen wrote:
> On 11/30/20 5:04 AM, Heikki Linnakangas wrote:
>> On 26/11/2020 16:58, Andy Fan wrote:
>>> This patch has stopped moving for a while, any suggestion about
>>> how to move on is appreciated.
>>
>> The question on whether UniqueKey.exprs should be a list of
>> EquivalenceClasses or PathKeys is unresolved. I don't have an opinion
>> on that, but I'd suggest that you pick one or the other and just go
>> with it. If it turns out to be a bad choice, then we'll change it.
>
> In this case I think it is matter of deciding if we are going to use
> EquivalenceClasses or Exprs before going further; there has been work
> ongoing in this area for a while, so having a clear direction from a
> committer would be greatly appreciated.

Plain Exprs are not good enough, because you need to know which operator
the expression is unique on. Usually, it's the default = operator in the
default btree opclass for the datatype, but it could be something else, too.

Actually I can't understand this, could you explain more? Based on my current

knowledge, when we run "SELECT DISTINCT a FROM t", we never care about

which operator to use for the unique.

There's some precedence for PathKeys, as we generate PathKeys to
represent the DISTINCT column in PlannerInfo->distinct_pathkeys. On the
other hand, I've always found it confusing that we use PathKeys to
represent DISTINCT and GROUP BY, which are not actually sort orderings.

OK, I have the same confusion now:)

Perhaps it would make sense to store EquivalenceClass+opfamily in
UniqueKey, and also replace distinct_pathkeys and group_pathkeys with
UniqueKeys.

I can understand why we need EquivalenceClass for UniqueKey, but I can't

understand why we need opfamily here.

For anyone who is interested with these patchsets, here is my plan about this

now. 1). I will try EquivalenceClass rather than Expr in UniqueKey and add opfamily

if needed. 2). I will start a new thread to continue this topic. The current thread is too long

which may scare some people who may have interest in it. 3). I will give up patch 5 & 6

for now. one reason I am not happy with the current implementation, and the other

reason is I want to make the patchset smaller to make the reviewer easier. I will not

give up them forever, after the main part of this patchset is committed, I will continue

with them in a new thread.

Thanks everyone for your input.

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Heikki Linnakangas

Date:

05 December 2020, 18:40:33

On 05/12/2020 17:10, Andy Fan wrote:
> Actually I can't understand this, could you explain more?  Based on my 
> current
> knowledge,  when we run "SELECT DISTINCT a FROM t",  we never care about
> which operator to use for the unique.

SortGroupClause includes 'eqop' field, which determines the operator 
that the expression needs to made unique with. The syntax doesn't let 
you set it to anything else than the default btree opclass of the 
datatype, though. But you can specify it for ORDER BY, and we use 
SortGroupClauses to represent both sorting and grouping.

Also, if you use the same struct to also represent columns that you know 
to be unique, and not just the DISTINCT clause in the query, then you 
need the operator. For example, if you create a unique index on 
non-default opfamily.

>     There's some precedence for PathKeys, as we generate PathKeys to
>     represent the DISTINCT column in PlannerInfo->distinct_pathkeys. On the
>     other hand, I've always found it confusing that we use PathKeys to
>     represent DISTINCT and GROUP BY, which are not actually sort orderings.
> 
> 
> OK, I have the same confusion  now:)
> 
>     Perhaps it would  make sense to store EquivalenceClass+opfamily in
>     UniqueKey, and also replace distinct_pathkeys and group_pathkeys with
>     UniqueKeys.
> 
> 
> I can understand why we need EquivalenceClass for UniqueKey, but I can't
> understand why we need opfamily here.

Thinking a bit harder, I guess we don't. Because EquivalenceClass 
includes the operator family already, in the ec_opfamilies field.

> For anyone who is interested with these patchsets, here is my plan
> about this now.  1).  I will try EquivalenceClass rather than Expr in
> UniqueKey and add opfamily if needed. 2).  I will start a new thread
> to continue this topic. The current thread is too long which may
> scare some people who may have interest in it. 3). I will give up
> patch 5 & 6 for now.  one reason I am not happy with the current
> implementation, and the other reason is I want to make the patchset
> smaller to make the reviewer easier. I will not give up them forever,
> after the main part of this patchset is committed, I will continue 
> with them in a new thread. Thanks everyone for your input.
Sounds like a plan.

- Heikki

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Tom Lane

Date:

05 December 2020, 20:40:23

Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> I can understand why we need EquivalenceClass for UniqueKey, but I can't
>> understand why we need opfamily here.

> Thinking a bit harder, I guess we don't. Because EquivalenceClass 
> includes the operator family already, in the ec_opfamilies field.

No.  EquivalenceClasses only care about equality, which is why they
might potentially mention several opfamilies that share an equality
operator.  If you care about sort order, you *cannot* rely on an
EquivalenceClass to depict that.  Now, abstract uniqueness also only
cares about equality, but if you are going to implement it via sort-
and-unique then you need to settle on a sort order.

I agree we are overspecifying DISTINCT by settling on a sort operator at
parse time, rather than considering all the possibilities at plan time.
But given that opfamilies sharing equality are mostly a hypothetical
use-case, I'm not in a big hurry to fix it.  Before we had ASC/DESC
indexes, there was a real use-case for making a "reverse sort" opclass,
with the same equality as the type's regular opclass but the opposite sort
order.  But that's ancient history now, and I've seen few other plausible
use-cases.

I have not been following this thread closely enough to understand
why we need a new "UniqueKeys" data structure at all.  But if the
motivation is only to remove this overspecification, I humbly suggest
that it ain't worth the trouble.

            regards, tom lane

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

06 December 2020, 03:38:55

Thank you Tom and Heikki for your input.

On Sun, Dec 6, 2020 at 4:40 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki Linnakangas <hlinnaka@iki.fi> writes:
>> I can understand why we need EquivalenceClass for UniqueKey, but I can't
>> understand why we need opfamily here.

> Thinking a bit harder, I guess we don't. Because EquivalenceClass
> includes the operator family already, in the ec_opfamilies field.

No. EquivalenceClasses only care about equality, which is why they
might potentially mention several opfamilies that share an equality
operator. If you care about sort order, you *cannot* rely on an
EquivalenceClass to depict that. Now, abstract uniqueness also only
cares about equality, but if you are going to implement it via sort-
and-unique then you need to settle on a sort order.

I think UniqueKey only cares about equality. Even DISTINCT / groupBy

can be implemented with sort, but UniqueKey only care about the result

of DISTINCT/GROUPBY, so it doesn't matter IIUC.

I agree we are overspecifying DISTINCT by settling on a sort operator at
parse time, rather than considering all the possibilities at plan time.
But given that opfamilies sharing equality are mostly a hypothetical
use-case, I'm not in a big hurry to fix it. Before we had ASC/DESC
indexes, there was a real use-case for making a "reverse sort" opclass,
with the same equality as the type's regular opclass but the opposite sort
order. But that's ancient history now, and I've seen few other plausible
use-cases.

I have not been following this thread closely enough to understand
why we need a new "UniqueKeys" data structure at all.

Currently the UniqueKey is defined as a List of Expr, rather than EquivalenceClasses.

A complete discussion until now can be found at [1] (The messages I replied to also

care a lot and the information is completed). This patch has stopped at this place for

a while, I'm planning to try EquivalenceClasses, but any suggestion would be welcome.

But if the
motivation is only to remove this overspecification, I humbly suggest
that it ain't worth the trouble.

regards, tom lane

[1] https://www.postgresql.org/message-id/CAKU4AWqy3Uv67%3DPR8RXG6LVoO-cMEwfW_LMwTxHdGrnu%2Bcf%2BdA%40mail.gmail.com

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Jesper Pedersen

Date:

07 December 2020, 08:15:51

Hi,

On 12/5/20 10:38 PM, Andy Fan wrote:
> Currently the UniqueKey is defined as a List of Expr, rather than
> EquivalenceClasses.
> A complete discussion until now can be found at [1] (The messages I replied
> to also
> care a lot and the information is completed). This patch has stopped at
> this place for
> a while,  I'm planning to try EquivalenceClasses,  but any suggestion would
> be welcome.
> 
> 

Unfortunately I think we need a RfC style patch of both versions in 
their minimum implementation.

Hopefully this will make it easier for one or more committers to decide 
on the right direction since they can do a side-by-side comparison of 
the two solutions.

Just my $0.02.

Thanks for working on this Andy !

Best regards,
  Jesper

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

07 December 2020, 12:14:56

On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:

Hi,

On 12/5/20 10:38 PM, Andy Fan wrote:
> Currently the UniqueKey is defined as a List of Expr, rather than
> EquivalenceClasses.
> A complete discussion until now can be found at [1] (The messages I replied
> to also
> care a lot and the information is completed). This patch has stopped at
> this place for
> a while, I'm planning to try EquivalenceClasses, but any suggestion would
> be welcome.
>
>

Unfortunately I think we need a RfC style patch of both versions in
their minimum implementation.

Hopefully this will make it easier for one or more committers to decide
on the right direction since they can do a side-by-side comparison of
the two solutions.

I do get the exact same idea. Actually I have made EquivalenceClasses

works with baserel last weekend and then I realized it is hard to compare

the 2 situations without looking into the real/Poc code, even for very

experienced people. I will submit a new patch after I get the partitioned

relation, subquery works. Hope I can make it in one week.

Just my $0.02.

Thanks for working on this Andy !

Best regards,
Jesper

Best Regards

Andy Fan

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Ashutosh Bapat

Date:

07 December 2020, 12:27:22

On Sun, Dec 6, 2020 at 9:09 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>
>> I have not been following this thread closely enough to understand
>> why we need a new "UniqueKeys" data structure at all.
>
>
> Currently the UniqueKey is defined as a List of Expr, rather than EquivalenceClasses.
> A complete discussion until now can be found at [1] (The messages I replied to also
> care a lot and the information is completed). This patch has stopped at this place for
> a while,  I'm planning to try EquivalenceClasses,  but any suggestion would be welcome.
>
>>
>> But if the
>> motivation is only to remove this overspecification, I humbly suggest
>> that it ain't worth the trouble.

AFAIK, the simple answer is we need some way to tell that certain
expressions together form a unique key for a given relation. E.g.
group by clause forms a unique key for the output of GROUP BY.
Pathkeys have a stronger requirement that the relation is ordered on
that expression, which may not be the case with uniqueness e.g. output
of GROUP BY produced by hash grouping. To me it's Pathkeys - ordering,
so we could use Pathkeys with reduced strength. But that might affect
a lot of places which depend upon stronger pathkeys.

-- 
Best Wishes,
Ashutosh Bapat

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

David Rowley

Date:

09 December 2020, 06:13:48

On Sun, 6 Dec 2020 at 04:10, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> For anyone who is interested with these patchsets, here is my plan about this
> now.  1).  I will try EquivalenceClass rather than Expr in UniqueKey and add opfamily
> if needed.

I agree that we should be storing them in EquivalenceClasses. Apart
from what was mentioned already it also allow the optimisation to work
in cases like:

create table t (a int not null unique, b int);
select distinct b from t where a = b;

David

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Masahiko Sawada

Date:

22 January 2021, 13:14:40

Hi Andy,

On Mon, Dec 7, 2020 at 9:15 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
>
> On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>>
>> Hi,
>>
>> On 12/5/20 10:38 PM, Andy Fan wrote:
>> > Currently the UniqueKey is defined as a List of Expr, rather than
>> > EquivalenceClasses.
>> > A complete discussion until now can be found at [1] (The messages I replied
>> > to also
>> > care a lot and the information is completed). This patch has stopped at
>> > this place for
>> > a while,  I'm planning to try EquivalenceClasses,  but any suggestion would
>> > be welcome.
>> >
>> >
>>
>> Unfortunately I think we need a RfC style patch of both versions in
>> their minimum implementation.
>>
>> Hopefully this will make it easier for one or more committers to decide
>> on the right direction since they can do a side-by-side comparison of
>> the two solutions.
>>
>
> I do get the exact same idea.  Actually I have made EquivalenceClasses
> works with baserel last weekend and then I realized it is hard to compare
> the 2 situations without looking into the real/Poc code, even for very
> experienced people.  I will submit a new patch after I get the partitioned
> relation, subquery works.  Hope I can make it in one week.

Status update for a commitfest entry.

Are you planning to submit a new patch? Or is there any blocker for
this work? This patch entry on CF app has been in state Waiting on
Author for a while. If there is any update on that, please reflect on
CF app.

Regards,


--
Masahiko Sawada
EDB:  https://www.enterprisedb.com/

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

24 January 2021, 10:26:33

Hi Masahiko:

On Fri, Jan 22, 2021 at 9:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi Andy,

On Mon, Dec 7, 2020 at 9:15 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
>
> On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>>
>> Hi,
>>
>> On 12/5/20 10:38 PM, Andy Fan wrote:
>> > Currently the UniqueKey is defined as a List of Expr, rather than
>> > EquivalenceClasses.
>> > A complete discussion until now can be found at [1] (The messages I replied
>> > to also
>> > care a lot and the information is completed). This patch has stopped at
>> > this place for
>> > a while, I'm planning to try EquivalenceClasses, but any suggestion would
>> > be welcome.
>> >
>> >
>>
>> Unfortunately I think we need a RfC style patch of both versions in
>> their minimum implementation.
>>
>> Hopefully this will make it easier for one or more committers to decide
>> on the right direction since they can do a side-by-side comparison of
>> the two solutions.
>>
>
> I do get the exact same idea. Actually I have made EquivalenceClasses
> works with baserel last weekend and then I realized it is hard to compare
> the 2 situations without looking into the real/Poc code, even for very
> experienced people. I will submit a new patch after I get the partitioned
> relation, subquery works. Hope I can make it in one week.

Status update for a commitfest entry.

Are you planning to submit a new patch? Or is there any blocker for
this work? This patch entry on CF app has been in state Waiting on
Author for a while. If there is any update on that, please reflect on
CF app.

I agree that the current status is "Waiting on author", and no block issue for others.

I plan to work on this in 1 month. I have to get my current urgent case completed first.

Sorry for the delay action and thanks for asking.

Best Regards

Andy Fan (https://www.aliyun.com/)

Re: [PATCH] Keeps tracking the uniqueness with UniqueKey

From

Andy Fan

Date:

11 February 2021, 02:05:07

On Sun, Jan 24, 2021 at 6:26 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:

Hi Masahiko:

On Fri, Jan 22, 2021 at 9:15 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
Hi Andy,

On Mon, Dec 7, 2020 at 9:15 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
>
> On Mon, Dec 7, 2020 at 4:16 PM Jesper Pedersen <jesper.pedersen@redhat.com> wrote:
>>
>> Hi,
>>
>> On 12/5/20 10:38 PM, Andy Fan wrote:
>> > Currently the UniqueKey is defined as a List of Expr, rather than
>> > EquivalenceClasses.
>> > A complete discussion until now can be found at [1] (The messages I replied
>> > to also
>> > care a lot and the information is completed). This patch has stopped at
>> > this place for
>> > a while, I'm planning to try EquivalenceClasses, but any suggestion would
>> > be welcome.
>> >
>> >
>>
>> Unfortunately I think we need a RfC style patch of both versions in
>> their minimum implementation.
>>
>> Hopefully this will make it easier for one or more committers to decide
>> on the right direction since they can do a side-by-side comparison of
>> the two solutions.
>>
>
> I do get the exact same idea. Actually I have made EquivalenceClasses
> works with baserel last weekend and then I realized it is hard to compare
> the 2 situations without looking into the real/Poc code, even for very
> experienced people. I will submit a new patch after I get the partitioned
> relation, subquery works. Hope I can make it in one week.

Status update for a commitfest entry.

Are you planning to submit a new patch? Or is there any blocker for
this work? This patch entry on CF app has been in state Waiting on
Author for a while. If there is any update on that, please reflect on
CF app.

I agree that the current status is "Waiting on author", and no block issue for others.
I plan to work on this in 1 month. I have to get my current urgent case completed first.
Sorry for the delay action and thanks for asking.

I'd start to continue this work today. At the same time, I will split the multi-patch series

into some dedicated small chunks for easier review. The first one is just for adding a

notnullattrs in RelOptInfo struct, in thread [1].

https://www.postgresql.org/message-id/flat/CAKU4AWpQjAqJwQ2X-aR9g3%2BZHRzU1k8hNP7A%2B_mLuOv-n5aVKA%40mail.gmail.com

Best Regards

Andy Fan (https://www.aliyun.com/)