Thread: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:
Hi:

This patch is the first patch in UniqueKey patch series[1], since I need to revise 
that series many times but the first one would be not that often, so I'd like to
submit this one for review first so that I don't need to maintain it again and again.

v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch

Introduce notnullattrs field in RelOptInfo to indicate which attr are not null
in current query. The not null is judged by checking pg_attribute and query's
restrictinfo. The info is only maintained at Base RelOptInfo and Partition's
RelOptInfo level so far.


Any thoughts?

--
Best Regards

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:

On Wed, Feb 10, 2021 at 11:18 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Hi:

This patch is the first patch in UniqueKey patch series[1], since I need to revise 
that series many times but the first one would be not that often, so I'd like to
submit this one for review first so that I don't need to maintain it again and again.

v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch

Introduce notnullattrs field in RelOptInfo to indicate which attr are not null
in current query. The not null is judged by checking pg_attribute and query's
restrictinfo. The info is only maintained at Base RelOptInfo and Partition's
RelOptInfo level so far.


Any thoughts?

--
Best Regards

Add the missed patch..

--
Best Regards
Attachment

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Ashutosh Bapat
Date:
Can this information be part of PathTarget structure and hence part of
RelOptInfo::reltarget, so that it can be extended to join, group and
other kinds of RelOptInfo in future? In fact, it might be easy to do
that in this patch itself.

On Wed, Feb 10, 2021 at 8:57 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
> On Wed, Feb 10, 2021 at 11:18 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>
>> Hi:
>>
>> This patch is the first patch in UniqueKey patch series[1], since I need to revise
>> that series many times but the first one would be not that often, so I'd like to
>> submit this one for review first so that I don't need to maintain it again and again.
>>
>> v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch
>>
>> Introduce notnullattrs field in RelOptInfo to indicate which attr are not null
>> in current query. The not null is judged by checking pg_attribute and query's
>> restrictinfo. The info is only maintained at Base RelOptInfo and Partition's
>> RelOptInfo level so far.
>>
>>
>> Any thoughts?
>>
>> [1] https://www.postgresql.org/message-id/CAKU4AWr1BmbQB4F7j22G%2BNS4dNuem6dKaUf%2B1BK8me61uBgqqg%40mail.gmail.com
>> --
>> Best Regards
>> Andy Fan (https://www.aliyun.com/)
>
>
> Add the missed patch..
>
> --
> Best Regards
> Andy Fan (https://www.aliyun.com/)



-- 
Best Wishes,
Ashutosh Bapat



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Tom Lane
Date:
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> Can this information be part of PathTarget structure and hence part of
> RelOptInfo::reltarget, so that it can be extended to join, group and
> other kinds of RelOptInfo in future?

Why would that be better than keeping it in RelOptInfo?

            regards, tom lane



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
David Rowley
Date:
On Wed, 10 Feb 2021 at 16:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch
>
> Introduce notnullattrs field in RelOptInfo to indicate which attr are not null
> in current query. The not null is judged by checking pg_attribute and query's
> restrictinfo. The info is only maintained at Base RelOptInfo and Partition's
> RelOptInfo level so far.
>
>
> Any thoughts?

I'm not that happy with what exactly the definition is of
RelOptInfo.notnullattrs.

The comment for the field says:
+ /* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */

So you could expect someone to assume that these are a Bitmapset of
attnums for all columns in the relation marked as NOT NULL.  However,
that's not true since you use find_nonnullable_vars() to chase down
quals that filter out NULL values and you mark those too.

The reason I don't really like this is that it really depends where
you want to use RelOptInfo.notnullattrs.  If someone wants to use it
to optimise something before the base quals are evaluated then they
might be unhappy that they found some NULLs.

I think you either need to explain in detail what the field means or
separate out the two meanings somehow.

David



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:
Thank you all, friends!

On Fri, Feb 12, 2021 at 9:02 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Wed, 10 Feb 2021 at 16:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch
>
> Introduce notnullattrs field in RelOptInfo to indicate which attr are not null
> in current query. The not null is judged by checking pg_attribute and query's
> restrictinfo. The info is only maintained at Base RelOptInfo and Partition's
> RelOptInfo level so far.
>
>
> Any thoughts?

I'm not that happy with what exactly the definition is of
RelOptInfo.notnullattrs.

The comment for the field says:
+ /* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */

So you could expect someone to assume that these are a Bitmapset of
attnums for all columns in the relation marked as NOT NULL.  However,
that's not true since you use find_nonnullable_vars() to chase down
quals that filter out NULL values and you mark those too.


The comment might be unclear,  but the behavior is on purpose. I want
to find more cases which can make the attr NOT NULL, all of them are
useful for UniqueKey stuff.  

 
The reason I don't really like this is that it really depends where
you want to use RelOptInfo.notnullattrs.  If someone wants to use it
to optimise something before the base quals are evaluated then they
might be unhappy that they found some NULLs.


Do you mean the notnullattrs is not set correctly before the base quals are
evaluated?  I think we have lots of data structures which are set just after some
stage.  but notnullattrs is special because it is set at more than 1 stage.  However
I'm doubtful it is unacceptable, Some fields ever change their meaning at different
stages like Var->varno.  If a user has a misunderstanding on it, it probably will find it
at the testing stage.  
 
I think you either need to explain in detail what the field means or
separate out the two meanings somehow.


Agreed.   Besides the not null comes from 2 places (metadata and quals), it also
means it is based on the relation, rather than the RelTarget.  for sample:  A is not
null,  but SELECT  return_null_udf(A)  FROM t,   return_null_udf is NULL.  I think
this is not well documented as well.  How about just change the documents as:

1.  /* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber, the NOT NULL
      * comes from pg_attribtes and quals at different planning stages.
      */

or

2.  /* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber, the NOT NULL
      * comes from pg_attribtes and quals at different planning stages. And it just means
      * the base attr rather than RelOptInfo->reltarget. 
      */

I don't like to separate them into 2 fields because it may make the usage harder a
bit as well. 
--
Best Regards

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:


On Thu, Feb 11, 2021 at 9:09 PM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote:
Can this information be part of PathTarget structure and hence part of
RelOptInfo::reltarget, so that it can be extended to join, group and
other kinds of RelOptInfo in future?

I think you want to expand this field in a more generic way.  For example:
SELECT udf(a)  FROM t WHERE  a is not null;   In current implementation, I only
knows a is not null, nothing about if udf(a) is null or not. And we can't present anything
for joinrel as well since it is just attno. 

At the same time,  looks we can't tell if UDF(A) is null  even if the UDF is strict and 
A is not null?  
 
In fact, it might be easy to do that in this patch itself.

Actually I can't think out the method:)  


On Wed, Feb 10, 2021 at 8:57 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
>
> On Wed, Feb 10, 2021 at 11:18 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>
>> Hi:
>>
>> This patch is the first patch in UniqueKey patch series[1], since I need to revise
>> that series many times but the first one would be not that often, so I'd like to
>> submit this one for review first so that I don't need to maintain it again and again.
>>
>> v1-0001-Introduce-notnullattrs-field-in-RelOptInfo-to-ind.patch
>>
>> Introduce notnullattrs field in RelOptInfo to indicate which attr are not null
>> in current query. The not null is judged by checking pg_attribute and query's
>> restrictinfo. The info is only maintained at Base RelOptInfo and Partition's
>> RelOptInfo level so far.
>>
>>
>> Any thoughts?
>>
>> [1] https://www.postgresql.org/message-id/CAKU4AWr1BmbQB4F7j22G%2BNS4dNuem6dKaUf%2B1BK8me61uBgqqg%40mail.gmail.com
>> --
>> Best Regards
>> Andy Fan (https://www.aliyun.com/)
>
>
> Add the missed patch..
>
> --
> Best Regards
> Andy Fan (https://www.aliyun.com/)



--
Best Wishes,
Ashutosh Bapat


--
Best Regards

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Ashutosh Bapat
Date:
On Thu, Feb 11, 2021 at 8:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> > Can this information be part of PathTarget structure and hence part of
> > RelOptInfo::reltarget, so that it can be extended to join, group and
> > other kinds of RelOptInfo in future?
>
> Why would that be better than keeping it in RelOptInfo?
>
>                         regards, tom lane

We have all the expressions relevant to a given relation (simple,
join, group whatever) in Pathtarget. We could remember notnullness of
attributes of a simple relation in RelOptInfo. But IMO non/nullness of
the TLEs of a relation is more useful that attributes and thus
associate those in the PathTarget which is used to produce TLEs. That
way we could use this infra in more general ways.

-- 
Best Wishes,
Ashutosh Bapat



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Tom Lane
Date:
Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
> On Thu, Feb 11, 2021 at 8:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> writes:
>>> Can this information be part of PathTarget structure and hence part of
>>> RelOptInfo::reltarget, so that it can be extended to join, group and
>>> other kinds of RelOptInfo in future?

>> Why would that be better than keeping it in RelOptInfo?

> We have all the expressions relevant to a given relation (simple,
> join, group whatever) in Pathtarget. We could remember notnullness of
> attributes of a simple relation in RelOptInfo. But IMO non/nullness of
> the TLEs of a relation is more useful that attributes and thus
> associate those in the PathTarget which is used to produce TLEs. That
> way we could use this infra in more general ways.

That argument seems nearly vacuous to me, because for pretty much any
expression that isn't a base-relation Var, the answer will have to be
"don't know".  Meanwhile, there are clear costs to keeping such info
in PathTarget, namely having to copy it around.  Another issue with
keeping it in PathTarget is that I'm not convinced it'll be readily
available where you need it: most places that would be interested in
making such proofs are only looking at expression trees.

Now there is one angle that *might* become easier if the info were in
PathTarget, namely that it could be simpler and more reliable to mark
nullable output columns of an outer join as being nullable (even if
they came from not-null base columns).  However, as I've muttered
about elsewhere, I'd prefer to deal with that can of worms by altering
the representation of the Vars themselves.  Again, if you're looking
at a WHERE clause, it's not real clear how you would find a relevant
PathTarget.

            regards, tom lane



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
David Rowley
Date:
On Fri, 12 Feb 2021 at 15:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> On Fri, Feb 12, 2021 at 9:02 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> The reason I don't really like this is that it really depends where
>> you want to use RelOptInfo.notnullattrs.  If someone wants to use it
>> to optimise something before the base quals are evaluated then they
>> might be unhappy that they found some NULLs.
>>
>
> Do you mean the notnullattrs is not set correctly before the base quals are
> evaluated?  I think we have lots of data structures which are set just after some
> stage.  but notnullattrs is special because it is set at more than 1 stage.  However
> I'm doubtful it is unacceptable, Some fields ever change their meaning at different
> stages like Var->varno.  If a user has a misunderstanding on it, it probably will find it
> at the testing stage.

You're maybe focusing too much on your use case for notnullattrs. It
only cares about NULLs in the result for each query level.

.... thinks of an example...

OK, let's say I decided that COUNT(*) is faster than COUNT(id) so
decided that I might like to write a patch which rewrite the query to
use COUNT(*) when it was certain that "id" could not contain NULLs.

The query is:

SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER
JOIN sales s ON p.partid = s.partid GROUP BY p.partid;

sale.saleid is marked as NOT NULL in pg_attribute.  As the writer of
the patch, I checked the comment for notnullattrs and it says "Not
null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I
should be ok to assume since sales.saleid is marked in notnullattrs
that I can rewrite the query?!

The documentation about the RelOptInfo.notnullattrs needs to be clear
what exactly it means. I'm not saying your representation of how to
record NOT NULL in incorrect. I'm saying that you need to be clear
what exactly is being recorded in that field.

If you want it to mean "attribute marked here cannot output NULL
values at this query level", then you should say something along those
lines.

However, having said that, because this is a Bitmapset of
pg_attribute.attnums, it's only possible to record Vars from base
relations.  It does not seem like you have any means to record
attributes that are normally NULLable, but cannot produce NULL values
due to a strict join qual.

e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something;

I'd expect the RelOptInfo for t not to contain a bit for the
"nullable" column, but there's no way to record the fact that the join
RelOptInfo for {t,j} cannot produce a NULL for that column. It might
be quite useful to know that for the UniqueKeys patch.

I know there's another discussion here between Ashutosh and Tom about
PathTarget's and Vars.   I had the Var idea too once myself [1] but it
was quickly shot down.  Tom's reasoning there in [1] seems legit.  I
guess we'd need some sort of planner version of Var and never confuse
it with the Parse version of Var.  That sounds like quite a big
project which would have quite a lot of code churn. I'm not sure how
acceptable it would be to have Var represent both these things.  It
gets complex when you do equal(var1, var2) and expect that to return
true when everything matches apart from the notnull field. We
currently have this issue with the "location" field and we even have a
special macro which just ignores those in equalfuncs.c.  I imagine not
many people would like to expand that to other fields.

It would be good to agree on the correct representation for Vars that
cannot produce NULLs so that we don't shut the door on classes of
optimisation that require something other than what you need for your
case.

David

[1] https://www.postgresql.org/message-id/flat/14678.1401639369%40sss.pgh.pa.us#d726d397f86755b64bb09d0c487f975f



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:


On Tue, Feb 16, 2021 at 12:01 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 12 Feb 2021 at 15:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> On Fri, Feb 12, 2021 at 9:02 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> The reason I don't really like this is that it really depends where
>> you want to use RelOptInfo.notnullattrs.  If someone wants to use it
>> to optimise something before the base quals are evaluated then they
>> might be unhappy that they found some NULLs.
>>
>
> Do you mean the notnullattrs is not set correctly before the base quals are
> evaluated?  I think we have lots of data structures which are set just after some
> stage.  but notnullattrs is special because it is set at more than 1 stage.  However
> I'm doubtful it is unacceptable, Some fields ever change their meaning at different
> stages like Var->varno.  If a user has a misunderstanding on it, it probably will find it
> at the testing stage.

You're maybe focusing too much on your use case for notnullattrs. It
only cares about NULLs in the result for each query level.

.... thinks of an example...

OK, let's say I decided that COUNT(*) is faster than COUNT(id) so
decided that I might like to write a patch which rewrite the query to
use COUNT(*) when it was certain that "id" could not contain NULLs.

The query is:

SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER
JOIN sales s ON p.partid = s.partid GROUP BY p.partid;

sale.saleid is marked as NOT NULL in pg_attribute.  As the writer of
the patch, I checked the comment for notnullattrs and it says "Not
null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I
should be ok to assume since sales.saleid is marked in notnullattrs
that I can rewrite the query?!

The documentation about the RelOptInfo.notnullattrs needs to be clear
what exactly it means. I'm not saying your representation of how to
record NOT NULL in incorrect. I'm saying that you need to be clear
what exactly is being recorded in that field.

If you want it to mean "attribute marked here cannot output NULL
values at this query level", then you should say something along those
lines.

I think I get what you mean. You are saying notnullattrs is only correct
at the *current* stage, namely set_rel_size.  It would not be true after
that,  but the data is still there.  That would cause some confusion.  I admit
that is something I didn't realize before.  I checked other fields of RelOptInfo,
looks no one filed works like this, so I am not really happy with this design
now.  I'm OK with saying more things along these lines.  That can be done
we all understand each other well.  Any better design is welcome as well. 
I think the  "Var represents null stuff" is good, until I see your comments below. 



However, having said that, because this is a Bitmapset of
pg_attribute.attnums, it's only possible to record Vars from base
relations.  It does not seem like you have any means to record
attributes that are normally NULLable, but cannot produce NULL values
due to a strict join qual.

e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something;

I'd expect the RelOptInfo for t not to contain a bit for the
"nullable" column, but there's no way to record the fact that the join
RelOptInfo for {t,j} cannot produce a NULL for that column. It might
be quite useful to know that for the UniqueKeys patch.


The current patch can detect t.nullable is not null correctly. That
is done by find_nonnullable_vars(qual) and deconstruct_recure stage. 
 
I know there's another discussion here between Ashutosh and Tom about
PathTarget's and Vars.   I had the Var idea too once myself [1] but it
was quickly shot down.  Tom's reasoning there in [1] seems legit.  I
guess we'd need some sort of planner version of Var and never confuse
it with the Parse version of Var.  That sounds like quite a big
project which would have quite a lot of code churn. I'm not sure how
acceptable it would be to have Var represent both these things.  It
gets complex when you do equal(var1, var2) and expect that to return
true when everything matches apart from the notnull field. We
currently have this issue with the "location" field and we even have a
special macro which just ignores those in equalfuncs.c.  I imagine not
many people would like to expand that to other fields.


Thanks for sharing this. 
 
It would be good to agree on the correct representation for Vars that
cannot produce NULLs so that we don't shut the door on classes of
optimisation that require something other than what you need for your
case.


Agreed.   The simplest way is just adding some comments.  If go a
step further, how about just reset the notnullattrs when it is nullable
later like outer join?  I have added this logic in the attached patch.
(comment for the notnullattrs is still not touched).  I think we only
need to handle this in build_join_rel stage.  With the v2 commit 2,
notnullattrs might be unset too early, but if the value is there, then
it is correct. 

--
Best Regards
Attachment

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:


On Tue, Feb 16, 2021 at 10:03 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:


On Tue, Feb 16, 2021 at 12:01 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 12 Feb 2021 at 15:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> On Fri, Feb 12, 2021 at 9:02 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> The reason I don't really like this is that it really depends where
>> you want to use RelOptInfo.notnullattrs.  If someone wants to use it
>> to optimise something before the base quals are evaluated then they
>> might be unhappy that they found some NULLs.
>>
>
> Do you mean the notnullattrs is not set correctly before the base quals are
> evaluated?  I think we have lots of data structures which are set just after some
> stage.  but notnullattrs is special because it is set at more than 1 stage.  However
> I'm doubtful it is unacceptable, Some fields ever change their meaning at different
> stages like Var->varno.  If a user has a misunderstanding on it, it probably will find it
> at the testing stage.

You're maybe focusing too much on your use case for notnullattrs. It
only cares about NULLs in the result for each query level.

.... thinks of an example...

OK, let's say I decided that COUNT(*) is faster than COUNT(id) so
decided that I might like to write a patch which rewrite the query to
use COUNT(*) when it was certain that "id" could not contain NULLs.

The query is:

SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER
JOIN sales s ON p.partid = s.partid GROUP BY p.partid;

sale.saleid is marked as NOT NULL in pg_attribute.  As the writer of
the patch, I checked the comment for notnullattrs and it says "Not
null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I
should be ok to assume since sales.saleid is marked in notnullattrs
that I can rewrite the query?!

The documentation about the RelOptInfo.notnullattrs needs to be clear
what exactly it means. I'm not saying your representation of how to
record NOT NULL in incorrect. I'm saying that you need to be clear
what exactly is being recorded in that field.

If you want it to mean "attribute marked here cannot output NULL
values at this query level", then you should say something along those
lines.

I think I get what you mean. You are saying notnullattrs is only correct
at the *current* stage, namely set_rel_size.  It would not be true after
that,  but the data is still there.  That would cause some confusion.  I admit
that is something I didn't realize before.  I checked other fields of RelOptInfo,
looks no one filed works like this, so I am not really happy with this design
now.  I'm OK with saying more things along these lines.  That can be done
we all understand each other well.  Any better design is welcome as well. 
I think the  "Var represents null stuff" is good, until I see your comments below. 



However, having said that, because this is a Bitmapset of
pg_attribute.attnums, it's only possible to record Vars from base
relations.  It does not seem like you have any means to record
attributes that are normally NULLable, but cannot produce NULL values
due to a strict join qual.

e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something;

I'd expect the RelOptInfo for t not to contain a bit for the
"nullable" column, but there's no way to record the fact that the join
RelOptInfo for {t,j} cannot produce a NULL for that column. It might
be quite useful to know that for the UniqueKeys patch.


The current patch can detect t.nullable is not null correctly. That
is done by find_nonnullable_vars(qual) and deconstruct_recure stage. 
 
I know there's another discussion here between Ashutosh and Tom about
PathTarget's and Vars.   I had the Var idea too once myself [1] but it
was quickly shot down.  Tom's reasoning there in [1] seems legit.  I
guess we'd need some sort of planner version of Var and never confuse
it with the Parse version of Var.  That sounds like quite a big
project which would have quite a lot of code churn. I'm not sure how
acceptable it would be to have Var represent both these things.  It
gets complex when you do equal(var1, var2) and expect that to return
true when everything matches apart from the notnull field. We
currently have this issue with the "location" field and we even have a
special macro which just ignores those in equalfuncs.c.  I imagine not
many people would like to expand that to other fields.


Thanks for sharing this. 
 
It would be good to agree on the correct representation for Vars that
cannot produce NULLs so that we don't shut the door on classes of
optimisation that require something other than what you need for your
case.


Agreed.   The simplest way is just adding some comments.  If go a
step further, how about just reset the notnullattrs when it is nullable
later like outer join?  I have added this logic in the attached patch.
(comment for the notnullattrs is still not touched).  I think we only
need to handle this in build_join_rel stage. 

 .. 
With the v2 commit 2,
notnullattrs might be unset too early, but if the value is there, then
it is correct. 


This looks bad as well.  How about adding an extra field in RelOptInfo for the
outer join case.  For example:

@@ -710,8 +710,14 @@ typedef struct RelOptInfo
        PlannerInfo *subroot;           /* if subquery */
        List       *subplan_params; /* if subquery */
        int                     rel_parallel_workers;   /* wanted number of parallel workers */
-       /* Not null attrs, start from -FirstLowInvalidHeapAttributeNumber */
+       /*
+        * Not null attrs, start from -FirstLowInvalidHeapAttributeNumber. The nullness
+        * might be changed after outer join, So we need to consult with leftouter_relids
+        * before using it.
+        */
        Bitmapset               *notnullattrs;
+       /* A list of Relids which will be a outer rel when join with this relation. */
+       List    *leftouter_relids;

        /* Information about foreign tables and foreign joins */
        Oid                     serverid;               /* identifies server for the table or join */

leftout_relids should be able to be filled with root->join_info_list.  If we go with this
direction,  not sure leftouter_relids should be a List or not since I even can't think
out a query which can have more than one relids for a relation. 

--
Best Regards

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:


On Tue, Feb 16, 2021 at 12:01 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 12 Feb 2021 at 15:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> On Fri, Feb 12, 2021 at 9:02 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> The reason I don't really like this is that it really depends where
>> you want to use RelOptInfo.notnullattrs.  If someone wants to use it
>> to optimise something before the base quals are evaluated then they
>> might be unhappy that they found some NULLs.
>>
>
> Do you mean the notnullattrs is not set correctly before the base quals are
> evaluated?  I think we have lots of data structures which are set just after some
> stage.  but notnullattrs is special because it is set at more than 1 stage.  However
> I'm doubtful it is unacceptable, Some fields ever change their meaning at different
> stages like Var->varno.  If a user has a misunderstanding on it, it probably will find it
> at the testing stage.

You're maybe focusing too much on your use case for notnullattrs. It
only cares about NULLs in the result for each query level.

.... thinks of an example...

OK, let's say I decided that COUNT(*) is faster than COUNT(id) so
decided that I might like to write a patch which rewrite the query to
use COUNT(*) when it was certain that "id" could not contain NULLs.

The query is:

SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER
JOIN sales s ON p.partid = s.partid GROUP BY p.partid;

sale.saleid is marked as NOT NULL in pg_attribute.  As the writer of
the patch, I checked the comment for notnullattrs and it says "Not
null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I
should be ok to assume since sales.saleid is marked in notnullattrs
that I can rewrite the query?!

The documentation about the RelOptInfo.notnullattrs needs to be clear
what exactly it means. I'm not saying your representation of how to
record NOT NULL in incorrect. I'm saying that you need to be clear
what exactly is being recorded in that field.

If you want it to mean "attribute marked here cannot output NULL
values at this query level", then you should say something along those
lines.

However, having said that, because this is a Bitmapset of
pg_attribute.attnums, it's only possible to record Vars from base
relations.  It does not seem like you have any means to record
attributes that are normally NULLable, but cannot produce NULL values
due to a strict join qual.

e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something;

I'd expect the RelOptInfo for t not to contain a bit for the
"nullable" column, but there's no way to record the fact that the join
RelOptInfo for {t,j} cannot produce a NULL for that column. It might
be quite useful to know that for the UniqueKeys patch.

I checked again and found I do miss the check on JoinExpr->quals.  I have
fixed it in v3 patch. Thanks for the review! 

In the attached v3,  commit 1 is the real patch, and commit 2 is just add 
some logs to help local testing.  notnull.sql/notnull.out is the test case for 
this patch, both commit 2 and notnull.* are not intended to be committed
at last.

Besides the above fix in v3, I changed the comments alongs the notnullattrs
as below and added a true positive helper function is_var_nullable. 

+       /*
+        * Not null attrs, the values are calculated by looking into pg_attribute and quals
+        * However both cases are not reliable in some outer join cases. So when
+        * we want to check if a Var is nullable, function is_var_nullable is a good
+        * place to start with, which is true positive.
+        */
+       Bitmapset               *notnullattrs;

I also found separating the two meanings is unnecessary since both of them
are not reliable in the outer join case and we are just wanting which attr is
nullable, no match how we know it. The below example shows why not-null-by-qual
is not readable as well (at least with current implementation) 

create table n1(a int, b int not null);
create table n2(a int, b int not null);
create table n3(a int, b int not null);

select * from n1 left join n2 on n1.a = n2.a full join n3 on n2.a = n3.a;

In this case, when we check (n1 left join n2 on n1.a = n2.a) ,  we know
n1.a is not nullable.  However after full join with n3, it changed. 


I know there's another discussion here between Ashutosh and Tom about
PathTarget's and Vars.   I had the Var idea too once myself [1] but it
was quickly shot down.  Tom's reasoning there in [1] seems legit.  I
guess we'd need some sort of planner version of Var and never confuse
it with the Parse version of Var.  That sounds like quite a big
project which would have quite a lot of code churn. I'm not sure how
acceptable it would be to have Var represent both these things.  It
gets complex when you do equal(var1, var2) and expect that to return
true when everything matches apart from the notnull field. We
currently have this issue with the "location" field and we even have a
special macro which just ignores those in equalfuncs.c.  I imagine not
many people would like to expand that to other fields.

It would be good to agree on the correct representation for Vars that
cannot produce NULLs so that we don't shut the door on classes of
optimisation that require something other than what you need for your
case.

David

[1] https://www.postgresql.org/message-id/flat/14678.1401639369%40sss.pgh.pa.us#d726d397f86755b64bb09d0c487f975f


--
Best Regards
Attachment


On Tue, Feb 16, 2021 at 12:01 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 12 Feb 2021 at 15:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> On Fri, Feb 12, 2021 at 9:02 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> The reason I don't really like this is that it really depends where
>> you want to use RelOptInfo.notnullattrs.  If someone wants to use it
>> to optimise something before the base quals are evaluated then they
>> might be unhappy that they found some NULLs.
>>
>
> Do you mean the notnullattrs is not set correctly before the base quals are
> evaluated?  I think we have lots of data structures which are set just after some
> stage.  but notnullattrs is special because it is set at more than 1 stage.  However
> I'm doubtful it is unacceptable, Some fields ever change their meaning at different
> stages like Var->varno.  If a user has a misunderstanding on it, it probably will find it
> at the testing stage.

You're maybe focusing too much on your use case for notnullattrs. It
only cares about NULLs in the result for each query level.

.... thinks of an example...

OK, let's say I decided that COUNT(*) is faster than COUNT(id) so
decided that I might like to write a patch which rewrite the query to
use COUNT(*) when it was certain that "id" could not contain NULLs.

The query is:

SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER
JOIN sales s ON p.partid = s.partid GROUP BY p.partid;

sale.saleid is marked as NOT NULL in pg_attribute.  As the writer of
the patch, I checked the comment for notnullattrs and it says "Not
null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I
should be ok to assume since sales.saleid is marked in notnullattrs
that I can rewrite the query?!

The documentation about the RelOptInfo.notnullattrs needs to be clear
what exactly it means. I'm not saying your representation of how to
record NOT NULL in incorrect. I'm saying that you need to be clear
what exactly is being recorded in that field.

If you want it to mean "attribute marked here cannot output NULL
values at this query level", then you should say something along those
lines.

However, having said that, because this is a Bitmapset of
pg_attribute.attnums, it's only possible to record Vars from base
relations.  It does not seem like you have any means to record
attributes that are normally NULLable, but cannot produce NULL values
due to a strict join qual.

e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something;

I'd expect the RelOptInfo for t not to contain a bit for the
"nullable" column, but there's no way to record the fact that the join
RelOptInfo for {t,j} cannot produce a NULL for that column. It might
be quite useful to know that for the UniqueKeys patch.


I read your comments again and find I miss your point before. So I'd summarize
the my current understanding to make sure we are in the same place for further
working.

I want to define a notnullattrs on RelOptInfo struct. The not nullable
may comes from catalog definition or quals on the given query. For example:

CREATE TABLE t(a INT NOT NULL, nullable INT);
SELECT * FROM t; ==>  a is not null for sure by definition.
SELECT * FROM t WHERE nullable > 3; ==> nullable is not null as well by qual.

However the thing becomes complex with the below 2 cases.

1. SELECT * FROM t INNER JOIN j on t.nullable = q.b;
We know t.b will not be null **finally**. But the current plan may something
like this:

                QUERY PLAN
------------------------------------------
 Merge Join
   Merge Cond: (t.nullable = j.something)
   ->  Sort
         Sort Key: t.nullable
         ->  Seq Scan on t
   ->  Sort
         Sort Key: j.something
         ->  Seq Scan on j
(8 rows)

which means the Path "Seq Scan on t" still contains some null values. At least,
we should not assume t.nullable is "not nullable" the base relation stage.

2. SELECT t.a FROM j LEFT JOIN t ON t.b = t.a;
Even the t.a is not null by definition, but it may have null **finally** due to
the outer join.

My current patch doesn't handle the 2 cases well since t.nullable is marked as
NOT NULL for both cases.
 
I know there's another discussion here between Ashutosh and Tom about
PathTarget's and Vars.   I had the Var idea too once myself [1] but it
was quickly shot down.  Tom's reasoning there in [1] seems legit.  I
guess we'd need some sort of planner version of Var and never confuse
it with the Parse version of Var.  That sounds like quite a big
project which would have quite a lot of code churn. I'm not sure how
acceptable it would be to have Var represent both these things.  It
gets complex when you do equal(var1, var2) and expect that to return
true when everything matches apart from the notnull field. We
currently have this issue with the "location" field and we even have a
special macro which just ignores those in equalfuncs.c.  I imagine not
many people would like to expand that to other fields.

It would be good to agree on the correct representation for Vars that
cannot produce NULLs so that we don't shut the door on classes of
optimisation that require something other than what you need for your
case.


Looks we have to maintain not null on the general RelOptInfo level rather than Base
RelOptInfo.  But I don't want to teach Var about the notnull so far.  The reasons are: 1).
We need to maintain the Planner version and Parser version due to the VIEW case. 
2). We have to ignore the extra part  for equal(Var, Var) . 3). Var is usually shared among
different RelOptInfo. which means we have to maintain different copies for this purpose IIUC.

I assume we want to know if a Var is nullable with a function like. 
is_var_notnullable(Var *var,  Relids relids).  If so, we can define the data as below:

struct RelOptInfo {

Bitmapset** notnullattrs;
..
}; 

After this we can implement the function as: 

bool
is_var_notnullable(Var* var, Relids relids)
{
  RelOptInfo *rel = find_rel_by_relids(reldis);
  return bms_is_member(var->varattno, rel->notnullattrs[var->varno]);
}

Probably we can make some hackers to reduce the notnullattrs's memory usage
overhead.

Any thoughts?

 
David

[1] https://www.postgresql.org/message-id/flat/14678.1401639369%40sss.pgh.pa.us#d726d397f86755b64bb09d0c487f975f


--
Best Regards


I assume we want to know if a Var is nullable with a function like. 
is_var_notnullable(Var *var,  Relids relids).  If so, we can define the data as below:

struct RelOptInfo {

Bitmapset** notnullattrs;
..
}; 

After this we can implement the function as: 

bool
is_var_notnullable(Var* var, Relids relids)
{
  RelOptInfo *rel = find_rel_by_relids(reldis);
  return bms_is_member(var->varattno, rel->notnullattrs[var->varno]);
}

Probably we can make some hackers to reduce the notnullattrs's memory usage
overhead.


To be more precise,  to make the rel->notnullattrs shorter, we can do the following methods:
1). Relids only has single element, we can always use a 1-len array rather than rel->varno
elements. 2).  For multi-elements relids, we use the max(varno) as the length of rel->notnullattrs.
3). For some cases,  the notnullattrs of a baserel is not changed in later stages, we can just
reuse the same Bitmapset * in later stages.   

--
Best Regards
However the thing becomes complex with the below 2 cases.

1. SELECT * FROM t INNER JOIN j on t.nullable = q.b;
We know t.b will be not null **finally**. But the current plan may something
like this:

                QUERY PLAN
------------------------------------------
 Merge Join
   Merge Cond: (t.nullable = j.something)
   ->  Sort
         Sort Key: t.nullable
         ->  Seq Scan on t
   ->  Sort
         Sort Key: j.something
         ->  Seq Scan on j
(8 rows)

which means the Path "Seq Scan on t" still contains some null values. At least,
we should not assume t.nullable is "not nullable" the base relation stage.

2. SELECT t.a FROM j LEFT JOIN t ON t.b = t.a;
Even the t.a is not null by definition, but it may have null **finally** due to
the outer join.

The above 2 cases have been addressed by defining the notnullattrs on
every RelOptInfo, and maintaining them on every join. However,  per offline
discussion with David, IIUC,  there is a more case to think about. 

CREATE TABLE t (a INT, b INT);
SELECT * FROM t WHERE a = 1 and b = 2;

We know b is not null after we evaluate the qual b = 2,  but it may still
nullable when we just evaluate a = 1;  

I prefer to not handle it by saying the semantics of notnullattrs is correct
after we evaluate all the quals on its RelOptInfo. 

 
It would be good to agree on the correct representation for Vars that
cannot produce NULLs so that we don't shut the door on classes of
optimisation that require something other than what you need for your
case.


Looks we have to maintain not null on the general RelOptInfo level rather than Base
RelOptInfo.  But I don't want to teach Var about the notnull so far.  The reasons are: 1).
We need to maintain the Planner version and Parser version due to the VIEW case. 
2). We have to ignore the extra part  for equal(Var, Var) . 3). Var is usually shared among
different RelOptInfo. which means we have to maintain different copies for this purpose IIUC.

I assume we want to know if a Var is nullable with a function like. 
is_var_notnullable(Var *var,  Relids relids).  If so, we can define the data as below:

struct RelOptInfo {

Bitmapset** notnullattrs;
..
}; 

After this we can implement the function as: 

/*
 * is_var_notnullable
 *   Check if the var is nullable for a given RelOptIno after
 * all the quals on it have been evaluated. 
 * 
 * var is the var to check,  relids is the ids of a RelOptInfo
 * we will check on. 
 */
bool
is_var_notnullable(Var* var, Relids relids)
{
  RelOptInfo *rel = find_rel_by_relids(reldis);
  return bms_is_member(var->varattno, rel->notnullattrs[var->varno]);
}

Do you think this is a reasonable solution? 

 
bool
is_var_notnullable(Var* var, Relids relids)
{
  RelOptInfo *rel = find_rel_by_relids(reldis);
  return bms_is_member(var->varattno, rel->notnullattrs[var->varno]);
}

Probably we can make some hackers to reduce the notnullattrs's memory usage
overhead.

--
Best Regards
Hi:
 I'd start to work on UniqueKey again, it would be great that we can target it
 to PG 15. The attached patch is just for the notnull_attrs. Since we can't say
 a column is nullable or not without saying in which resultset, so I think attaching
it to RelOptInfo is unavoidable. Here is how my patch works.

@@ -686,6 +686,12 @@ typedef struct RelOptInfo
  /* default result targetlist for Paths scanning this relation */
  struct PathTarget *reltarget; /* list of Vars/Exprs, cost, width */

+ Bitmapset **notnull_attrs; /* The attno which is not null after evaluating
+  * all the quals on this relation, for baserel,
+  * the len would always 1. and for others the array
+  * index is relid from relids.
+  */
+

For baserel, it records the notnull attrs as a bitmapset and stores it to
RelOptInfo->notnull_attrs[0].  As for the joinrel, suppose the relids is {1,3,
5}, then the notnull_attrs[1/3/5] will be used to store notnull_attrs Bitmapset
for relation 1,3,5 separately. I don't handle this stuff for all kinds of upper
relation and subquery so far since UniqueKey doesn't rely on it and looks
more stuff should be handled there.

The patch also included some debug messages in set_baserel/joinrel_notnullattrs
and attached the test.sql for easier review. Any feedback is welcome and hope 
this implementation would not block UniqueKey stuff.

Attachment

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
David Rowley
Date:
On Sun, 4 Jul 2021 at 02:08, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>  I'd start to work on UniqueKey again, it would be great that we can target it
>  to PG 15. The attached patch is just for the notnull_attrs. Since we can't say
>  a column is nullable or not without saying in which resultset, so I think attaching
> it to RelOptInfo is unavoidable. Here is how my patch works.

I'd also like to see this work progress for PG15.  My current thoughts
are that Tom as mentioned another way to track nullability inside Var.
It would be a fairly big task to do that.

Tom, I'm wondering if you might get a chance to draw up a design for
what you've got in mind with this?  I assume adding a new field in
Var, but I'm drawing a few blanks on how things might work for equal()
when one Var has the field set and another does not.

David



David Rowley <dgrowleyml@gmail.com> writes:
> Tom, I'm wondering if you might get a chance to draw up a design for
> what you've got in mind with this?  I assume adding a new field in
> Var, but I'm drawing a few blanks on how things might work for equal()
> when one Var has the field set and another does not.

As I said before, it hasn't progressed much past the handwaving stage,
but it does seem like it's time to get it done.  I doubt I'll have any
cycles for it during the commitfest, but maybe I can devote a block of
time during August.

            regards, tom lane





On Tue, Jul 6, 2021 at 9:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
David Rowley <dgrowleyml@gmail.com> writes:
> Tom, I'm wondering if you might get a chance to draw up a design for
> what you've got in mind with this?  I assume adding a new field in
> Var, but I'm drawing a few blanks on how things might work for equal()
> when one Var has the field set and another does not.

As I said before, it hasn't progressed much past the handwaving stage,
but it does seem like it's time to get it done.  I doubt I'll have any
cycles for it during the commitfest, but maybe I can devote a block of
time during August.

                        regards, tom lane

Looking forward to watching this change closely, thank you both David and Tom!
But I still don't understand what the faults my way have , do you mind telling the
details? 

--
Best Regards

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
David Rowley
Date:
On Wed, 7 Jul 2021 at 13:04, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> Looking forward to watching this change closely, thank you both David and Tom!
> But I still don't understand what the faults my way have , do you mind telling the
> details?

The problem is that we don't need 6 different ways to determine if a
Var can be NULL or not.  You're proposing to add a method using
Bitmapsets and Tom has some proposing ideas around tracking
nullability in Vars.  We don't need both.

It seems to me that having it in Var allows us to have a much finer
gradient about where exactly a Var can be NULL.

For example: SELECT nullablecol FROM tab WHERE nullablecol = <value>;

If the equality operator is strict then the nullablecol can be NULL in
the WHERE clause but not in the SELECT list. Tom's idea should allow
us to determine both of those things but your idea cannot tell them
apart, so, in theory at least, Tom's idea seems better to me.

David





For example: SELECT nullablecol FROM tab WHERE nullablecol = <value>;

If the equality operator is strict then the nullablecol can be NULL in
the WHERE clause but not in the SELECT list. Tom's idea should allow
us to determine both of those things but your idea cannot tell them
apart, so, in theory at least, Tom's idea seems better to me.

David

That's really something I can't do,  thanks for the explanation. 

--
Best Regards
On Tue, Jul 6, 2021 at 5:34 PM David Rowley <dgrowleyml@gmail.com> wrote:
>
> On Sun, 4 Jul 2021 at 02:08, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> >  I'd start to work on UniqueKey again, it would be great that we can target it
> >  to PG 15. The attached patch is just for the notnull_attrs. Since we can't say
> >  a column is nullable or not without saying in which resultset, So I think attaching
> > it to RelOptInfo is unavoidable. Here is how my patch works.
>
> I'd also like to see this work progress for PG15.


Thank you David!

I am re-designing/implementing the UniqueKey, but it is better to have
a design review as soon as possible. This writing is for that. To make the
review easier, I also uploaded my in-completed patch (Correct, runable
with testcase).

Main changes are:
1. Use EC instead of expr, to cover more UniqueKey case.
2. Redesign the UniqueKey as below:

@@ -246,6 +246,7 @@ struct PlannerInfo
* subquery outputs */

List   *eq_classes; /* list of active EquivalenceClasses */
+ List   *unique_exprs; /* List of unique expr */

  bool ec_merging_done; /* set true once ECs are canonical */

+typedef struct UniqueKey
+{
+ NodeTag type;
+ Bitmapset *unique_expr_indexes;
+ bool multi_nulls;
+} UniqueKey;
+

PlannerInfo.unique_exprs is a List of unique exprs.  Unique Exprs is a set of
EquivalenceClass. for example:

CREATE TABLE T1(A INT NOT NULL, B INT NOT NULL, C INT,  pk INT primary key);
CREATE UNIQUE INDEX ON t1(a, b);

SELECT DISTINCT * FROM T1 WHERE a = c;

Then we would have PlannerInfo.unique_exprs as below
[
[EC(a, c), EC(b)],
[EC(pk)]
]

RelOptInfo(t1) would have 2 UniqueKeys.
UniqueKey1 {unique_expr_indexes=bms{0}, multinull=false]
UniqueKey2 {unique_expr_indexes=bms{1}, multinull=false]

The design will benefit many table joins cases. For example, 10 tables
join. Each table has a primary key (a, b).  Then we would have a UniqueKey like
this.

JoinRel{1,2,3,4} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b}
JoinRel{1,2,3,4, 5} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b t5.a t5.b}

This would be memory consuming and building such UniqueKey is CPU consuming as
well. With the new design, we can store it as

PlannerInfo.unique_exprs =
[
[t1.a, t1.b], -- EC is ignored in document.
[t2.a, t2.b],
[t3.a, t3.b],
[t4.a, t4.b],
[t5.a, t5.b],
[t6.a, t6.b],
[t7.a, t7.b],
[t8.a, t8.b],
[t9.a, t9.b],
[t10.a, t10.b],
]

JoinRel{1,2,3,4} -  Bitmapset{0,1,2,3} -- one bitmapword.
JoinRel{1,2,3,4,5} -  Bitmapset{0,1,2,3,4} -- one bitmapword.

3. Define a new SingleRow node and use it in joinrel as well.

+typedef struct SingleRow
+{
+ NodeTag type;
+ Index relid;
+} SingleRow;

SELECT * FROM t1, t2 WHERE t2.pk = 1;

PlannerInfo.unique_exprs
[
[t1.a, t1.b],
SingleRow{relid=2}
]

JoinRel{t1} - Bitmapset{0}
JoinRel{t2} - Bitmapset{1}
JoinRelt{1, 2} Bitmapset{0, 1} -- SingleRow will never be expanded to dedicated
exprs.

4. Cut the useless UniqueKey totally on the baserel stage based on
   root->distinct_pathkey.  If we want to use it anywhere else, I think this
   design is OK as well. for example: group by UniqueKey.

5. Implemented the relation_is_distinct_for(root, rel, distinct_pathkey)
   effectively.  Here I used distinct_pathkey rather than
   Query->distinctClause.

Since I implemented the EC in PlannerInfo.unique_exprs point to the
PathKey.pk_eqclass, so we can compare the address directly with '=', rather than
equal(a, b). (since qual would check the address as well, so even I use equal,
the performance is good as well). SingleRow is handled as well for this case.

You can check the more details in the attached patch.  Any feedback is welcome.

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)

Attachment

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Ranier Vilela
Date:

>You can check the more details in the attached patch. Any feedback is welcome.

I have tiny comments about your patch:

1. name of file is uniquekey.c?

+ * pathkeys.c
+ *  Utilities for maintaining uniquekey.

2. Variable "PathKey *pathkey" at function: add_uniquekey_for_uniqueindex, can have scope reduced.

+ indexpr_item = list_head(unique_index->indexprs);
+ for (c = 0; c < unique_index->nkeycolumns; c++)
+ {
+ PathKey *pathkey;

3. Variable int c = 0, has a redundant initialization at function: add_uniquekey_for_uniqueindex.

4. Has one word with misspelled?

"/* We can't *guarantee* an FuncExpr will not return NULLs */"

4. Variable int i = -1, has a redudant initialization at function: uniquekey_contains_in

5. __attribute__ ((unused)) at function: build_composited_uniquekey, is incompatible with msvc.

6. Postgres uses a newline after variables declarations.

regards,

Ranier Vilela


> 4. Cut the useless UniqueKey totally on the baserel stage based on
>    root->distinct_pathkey.  If we want to use it anywhere else, I think this
>    design is OK as well. for example: group by UniqueKey.
>

The intention of this is I want to cut off the useless UniqueKey ASAP. In the
previous patch, I say "if the unique_exprs not exists in root->distinct_paths,
then it is useless". However This looks only works for single rel. As for the
joinrel, we have to maintain the UniqueKey on mergeable join clause for the case
like below.

SELECT DISTINCT t1.pk FROM t1, t2 WHERE t1.a = t2.pk;
or
SELECT DISTINCT t1.pk FROM t1 left join t2 on t1.a = t2.pk;

In this case, t2.pk isn't shown in distinct_pathkey, but it is still useful at
the join stage and not useful after joining.

So how can we maintain the UniqueKey like t2.pk?
1). If t2.pk exists in root->eq_classes, keep it.
2). If t2.pk doesn't exist in RelOptInfo->reltarget after joining, discard it.

Step 1 is not so bad since we have RelOptInfo.eclass_indexes. However step 2
looks pretty boring since we have to check on every RelOptInfo and we may have
lots of RelOptInfo.

Any suggestions on this?

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)



On Tue, Jul 13, 2021 at 5:55 PM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> > 4. Cut the useless UniqueKey totally on the baserel stage based on
> >    root->distinct_pathkey.  If we want to use it anywhere else, I think this
> >    design is OK as well. for example: group by UniqueKey.
> >
>
> The intention of this is I want to cut off the useless UniqueKey ASAP. In the
> previous patch, I say "if the unique_exprs not exists in root->distinct_paths,
> then it is useless". However This looks only works for single rel. As for the
> joinrel, we have to maintain the UniqueKey on mergeable join clause for the case
> like below.
>
> SELECT DISTINCT t1.pk FROM t1, t2 WHERE t1.a = t2.pk;
> or
> SELECT DISTINCT t1.pk FROM t1 left join t2 on t1.a = t2.pk;
>
> In this case, t2.pk isn't shown in distinct_pathkey, but it is still useful at
> the join stage and not useful after joining.
>
> So how can we maintain the UniqueKey like t2.pk?
> 1). If t2.pk exists in root->eq_classes, keep it.
> 2). If t2.pk doesn't exist in RelOptInfo->reltarget after joining, discard it.
>
> Step 1 is not so bad since we have RelOptInfo.eclass_indexes. However step 2
> looks pretty boring since we have to check on every RelOptInfo and we may have
> lots of RelOptInfo.
>
> Any suggestions on this?
>

Just a function like truncate_useless_pathkey would be OK. For that we need
to handle uniquekey_useful_for_merging and uniquekey_useful_for_distinct.

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:
Hi:

I have finished the parts for baserel, joinrel, subquery, distinctrel. I think
the hardest ones have been verified.  Here is the design overview.

1. Use EC instead of expr to cover more UniqueKey cases.
2. Redesign the UniqueKey as below:

@@ -246,6 +246,7 @@ struct PlannerInfo

List   *eq_classes; /* list of active EquivalenceClasses */
+ List   *unique_exprs; /* List of unique expr */

  bool ec_merging_done; /* set true once ECs are canonical */

+typedef struct UniqueKey
+{
+ NodeTag type;
+ Bitmapset *unique_expr_indexes;
+ bool multi_nulls;
+} UniqueKey;
+

PlannerInfo.unique_exprs is a List of unique exprs.  Unique Exprs is a set of
EquivalenceClass. for example:

CREATE TABLE T1(A INT NOT NULL, B INT NOT NULL, C INT,  pk INT primary key);
CREATE UNIQUE INDEX ON t1(a, b);

SELECT DISTINCT * FROM T1 WHERE a = c;

Then we would have PlannerInfo.unique_exprs as below
[
[EC(a, c), EC(b)],
[EC(pk)]
]

RelOptInfo(t1) would have 2 UniqueKeys.
UniqueKey1 {unique_expr_indexes=bms{0}, multinull=false]
UniqueKey2 {unique_expr_indexes=bms{1}, multinull=false]

The design will benefit many table joins cases. For instance a 10- tables join,
each table has a primary key (a, b).  Then we would have a UniqueKey like
this.

JoinRel{1,2,3,4} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b}
JoinRel{1,2,3,4,5} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b t5.a t5.b}

When more tables are joined, the list would be longer and longer, build the list
consumes both CPU cycles and memory.

With the method as above, we can store it as:

root->unique_exprs =  /* All the UniqueKey is stored once */
[
[t1.a, t1.b], -- EC is ignored in document.
[t2.a, t2.b],
[t3.a, t3.b],
[t4.a, t4.b],
[t5.a, t5.b],
[t6.a, t6.b],
[t7.a, t7.b],
[t8.a, t8.b],
[t9.a, t9.b],
[t10.a, t10.b],
]

JoinRel{1,2,3,4} -  Bitmapset{0,1,2,3} -- one bitmapword.
JoinRel{1,2,3,4,5} -  Bitmapset{0,1,2,3,4} -- one bitmapword.

The member in the bitmap is the index of root->unique_exprs rather than
root->eq_classes because we need to store the SingleRow case in
root->unqiue_exprs as well.

3. Define a new SingleRow node and use it in joinrel as well.

+typedef struct SingleRow
+{
+ NodeTag type;
+ Index relid;
+} SingleRow;

SELECT * FROM t1, t2 WHERE t2.pk = 1;

root->unique_exprs
[
[t1.a, t1.b],
SingleRow{relid=2}
]

JoinRel{t1} - Bitmapset{0}
JoinRel{t2} - Bitmapset{1}

JoinRelt{1, 2} Bitmapset{0, 1} -- SingleRow will never be expanded to dedicated
exprs.

4. Interesting UniqueKey to remove the Useless UniqueKey as soon as possible.

The overall idea is similar with PathKey, I distinguish the UniqueKey for
distinct purpose and useful_for_merging purpose.

SELECT DISTINCT pk FROM  t; -- OK, maintain it all the time, just like
root->query_pathkey.

SELECT DISTINCT t2.c FROM t1, t2 WHERE t1.d = t2.pk; -- T2's UniqueKey PK is
use before t1 join t2, but not useful after it.

Currently the known issue I didn't pass the "interesting UniqueKey" info to
subquery well [2], I'd like to talk more about this when we discuss about
UnqiueKey on subquery part.


5. relation_is_distinct_for

Now I design the function as

+ bool
+ relation_is_distinct_for(PlannerInfo *root, RelOptInfo *rel, List
  *distinct_pathkey)

It is "List *distinct_pathkey", rather than "List *exprs". The reason pathkey
has EC in it, and all the UniqueKey has EC as well. if so, the subset-like
checking is very effective.  As for the distinct/group as no-op case, we have
pathkey all the time. The only drawback of it is some clauses are not-sortable,
in this case, the root->distinct_pathkey and root->group_pathkey is not
set. However it should be rare by practice, so I ignore this part for
now. After all, I can have a relation_is_disticnt_for version for Exprs. I just
not implemented it so far.

6. EC overhead in UnqiueKey & UNION case.

Until now I didn't create any new EC for the UniqueKey case, I just used the
existing ones. However I can't handle the case like

SELECT a, b FROM t1
UNION
SELECT x, y FROM t2;

In this case, there is no EC created with existing code. and I don't want to
create them for the UniqueKey case as well.  so my plan is just not to handle
the case for UNION.

Since we need some big effort from the reviewer, I split the patch into many
smaller chunks.

Patch 1 / Patch 2:  I just split some code which UniqueKey uses but not very
interrelated. Splitting them out to reduce the core patch size.
Patch 3:  still the notnull stuff.  This one doesn't play a big role overall,
even if the design is changed at last, we can just modify some small stuff. IMO,
I don't think it is a blocker issue to move on.
Patch 4:  Support the UniqueKey for baserel.
Patch 5: Support the UniqueKey for joinrel.
Patch 6: Support the UniqueKey for some upper relation, like distinctrel,
groupby rel.

I'd suggest moving on like this:
1. Have an overall review to see if any outstanding issues the patch has.
2. If not, we can review and commit patch 1 & patch 2 to reduce the patch size.
3. Decide which method to take for not null stuff. IMO Tom's method
would be a bit
   complex and the benefit is not very clear to me[1].  So the choices
   include: a). move on UniqueKey stuff until Tom's method is ready. b). Move
   on the UniqueKey with my notnull way, and changes to Tom's method when
   necessary. I prefer method b).
4. Review & Commit the UniqueKey for BaseRel part.
5. Review & Commit the UniqueKey for JoinRel part.
6. Review & Commit the UniqueKey for SubQuery part *without* the Interesting
   UniqueKey well handled.
7. Review & Commit the UniqueKey for SubQuery part *with* the Interesting
   UniqueKey well handled.
8. Discuss about the UniqueKey on partitioned rel case and develop/review/commit
   it.
9. Apply UniqueKey stuff on more user cases rather than DISTINCT.

What do you think about this?

[1] https://www.postgresql.org/message-id/CAApHDvo5b2pYX%2BdbPy%2BysGBSarezRSfXthX32TZNFm0wPdfKGQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAKU4AWo6-%3D9mg3UQ5UJhGCMw6wyTPyPGgV5oh6dFvwEN%3D%2Bhb_w%40mail.gmail.com


Thanks

Attachment

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Zhihong Yu
Date:


On Sun, Aug 15, 2021 at 7:33 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
Hi:

I have finished the parts for baserel, joinrel, subquery, distinctrel. I think
the hardest ones have been verified.  Here is the design overview.

1. Use EC instead of expr to cover more UniqueKey cases.
2. Redesign the UniqueKey as below:

@@ -246,6 +246,7 @@ struct PlannerInfo

List   *eq_classes; /* list of active EquivalenceClasses */
+ List   *unique_exprs; /* List of unique expr */

  bool ec_merging_done; /* set true once ECs are canonical */

+typedef struct UniqueKey
+{
+ NodeTag type;
+ Bitmapset *unique_expr_indexes;
+ bool multi_nulls;
+} UniqueKey;
+

PlannerInfo.unique_exprs is a List of unique exprs.  Unique Exprs is a set of
EquivalenceClass. for example:

CREATE TABLE T1(A INT NOT NULL, B INT NOT NULL, C INT,  pk INT primary key);
CREATE UNIQUE INDEX ON t1(a, b);

SELECT DISTINCT * FROM T1 WHERE a = c;

Then we would have PlannerInfo.unique_exprs as below
[
[EC(a, c), EC(b)],
[EC(pk)]
]

RelOptInfo(t1) would have 2 UniqueKeys.
UniqueKey1 {unique_expr_indexes=bms{0}, multinull=false]
UniqueKey2 {unique_expr_indexes=bms{1}, multinull=false]

The design will benefit many table joins cases. For instance a 10- tables join,
each table has a primary key (a, b).  Then we would have a UniqueKey like
this.

JoinRel{1,2,3,4} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b}
JoinRel{1,2,3,4,5} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b t5.a t5.b}

When more tables are joined, the list would be longer and longer, build the list
consumes both CPU cycles and memory.

With the method as above, we can store it as:

root->unique_exprs =  /* All the UniqueKey is stored once */
[
[t1.a, t1.b], -- EC is ignored in document.
[t2.a, t2.b],
[t3.a, t3.b],
[t4.a, t4.b],
[t5.a, t5.b],
[t6.a, t6.b],
[t7.a, t7.b],
[t8.a, t8.b],
[t9.a, t9.b],
[t10.a, t10.b],
]

JoinRel{1,2,3,4} -  Bitmapset{0,1,2,3} -- one bitmapword.
JoinRel{1,2,3,4,5} -  Bitmapset{0,1,2,3,4} -- one bitmapword.

The member in the bitmap is the index of root->unique_exprs rather than
root->eq_classes because we need to store the SingleRow case in
root->unqiue_exprs as well.

3. Define a new SingleRow node and use it in joinrel as well.

+typedef struct SingleRow
+{
+ NodeTag type;
+ Index relid;
+} SingleRow;

SELECT * FROM t1, t2 WHERE t2.pk = 1;

root->unique_exprs
[
[t1.a, t1.b],
SingleRow{relid=2}
]

JoinRel{t1} - Bitmapset{0}
JoinRel{t2} - Bitmapset{1}

JoinRelt{1, 2} Bitmapset{0, 1} -- SingleRow will never be expanded to dedicated
exprs.

4. Interesting UniqueKey to remove the Useless UniqueKey as soon as possible.

The overall idea is similar with PathKey, I distinguish the UniqueKey for
distinct purpose and useful_for_merging purpose.

SELECT DISTINCT pk FROM  t; -- OK, maintain it all the time, just like
root->query_pathkey.

SELECT DISTINCT t2.c FROM t1, t2 WHERE t1.d = t2.pk; -- T2's UniqueKey PK is
use before t1 join t2, but not useful after it.

Currently the known issue I didn't pass the "interesting UniqueKey" info to
subquery well [2], I'd like to talk more about this when we discuss about
UnqiueKey on subquery part.


5. relation_is_distinct_for

Now I design the function as

+ bool
+ relation_is_distinct_for(PlannerInfo *root, RelOptInfo *rel, List
  *distinct_pathkey)

It is "List *distinct_pathkey", rather than "List *exprs". The reason pathkey
has EC in it, and all the UniqueKey has EC as well. if so, the subset-like
checking is very effective.  As for the distinct/group as no-op case, we have
pathkey all the time. The only drawback of it is some clauses are not-sortable,
in this case, the root->distinct_pathkey and root->group_pathkey is not
set. However it should be rare by practice, so I ignore this part for
now. After all, I can have a relation_is_disticnt_for version for Exprs. I just
not implemented it so far.

6. EC overhead in UnqiueKey & UNION case.

Until now I didn't create any new EC for the UniqueKey case, I just used the
existing ones. However I can't handle the case like

SELECT a, b FROM t1
UNION
SELECT x, y FROM t2;

In this case, there is no EC created with existing code. and I don't want to
create them for the UniqueKey case as well.  so my plan is just not to handle
the case for UNION.

Since we need some big effort from the reviewer, I split the patch into many
smaller chunks.

Patch 1 / Patch 2:  I just split some code which UniqueKey uses but not very
interrelated. Splitting them out to reduce the core patch size.
Patch 3:  still the notnull stuff.  This one doesn't play a big role overall,
even if the design is changed at last, we can just modify some small stuff. IMO,
I don't think it is a blocker issue to move on.
Patch 4:  Support the UniqueKey for baserel.
Patch 5: Support the UniqueKey for joinrel.
Patch 6: Support the UniqueKey for some upper relation, like distinctrel,
groupby rel.

I'd suggest moving on like this:
1. Have an overall review to see if any outstanding issues the patch has.
2. If not, we can review and commit patch 1 & patch 2 to reduce the patch size.
3. Decide which method to take for not null stuff. IMO Tom's method
would be a bit
   complex and the benefit is not very clear to me[1].  So the choices
   include: a). move on UniqueKey stuff until Tom's method is ready. b). Move
   on the UniqueKey with my notnull way, and changes to Tom's method when
   necessary. I prefer method b).
4. Review & Commit the UniqueKey for BaseRel part.
5. Review & Commit the UniqueKey for JoinRel part.
6. Review & Commit the UniqueKey for SubQuery part *without* the Interesting
   UniqueKey well handled.
7. Review & Commit the UniqueKey for SubQuery part *with* the Interesting
   UniqueKey well handled.
8. Discuss about the UniqueKey on partitioned rel case and develop/review/commit
   it.
9. Apply UniqueKey stuff on more user cases rather than DISTINCT.

What do you think about this?

[1] https://www.postgresql.org/message-id/CAApHDvo5b2pYX%2BdbPy%2BysGBSarezRSfXthX32TZNFm0wPdfKGQ%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAKU4AWo6-%3D9mg3UQ5UJhGCMw6wyTPyPGgV5oh6dFvwEN%3D%2Bhb_w%40mail.gmail.com


Thanks
Hi,
For v3-0005-Support-UniqueKey-on-JoinRel.patch :

+static void populate_joinrel_composited_uniquekey(PlannerInfo *root,

populate_joinrel_composited_uniquekey -> populate_joinrel_composite_uniquekey (without the trailing d for composite)

For populate_joinrel_uniquekeys():

+       foreach(lc, outerrel->uniquekeys)
+       {
...
+           return;

Should the remaining unique keys be considered ?

For populate_joinrel_uniquekey_for_rel():

+       else if (bms_equal(r->right_relids, rel->relids) && r->left_ec != NULL)
+       {
+           other_ecs = lappend(other_ecs, r->right_ec);
+           other_relids = bms_add_members(other_relids, r->left_relids);

It seems the append to other_ecs is the same as the one for the `bms_equal(r->left_relids, rel->relids) && r->right_ec != NULL` case. Is this correct ?

Cheers

Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:
Hi Zhihong,

On Mon, Aug 16, 2021 at 12:35 AM Zhihong Yu <zyu@yugabyte.com> wrote:
>
>
>
> On Sun, Aug 15, 2021 at 7:33 AM Andy Fan <zhihui.fan1213@gmail.com> wrote:
>>
>> Hi:
>>
>> I have finished the parts for baserel, joinrel, subquery, distinctrel. I think
>> the hardest ones have been verified.  Here is the design overview.
>>
>> 1. Use EC instead of expr to cover more UniqueKey cases.
>> 2. Redesign the UniqueKey as below:
>>
>> @@ -246,6 +246,7 @@ struct PlannerInfo
>>
>> List   *eq_classes; /* list of active EquivalenceClasses */
>> + List   *unique_exprs; /* List of unique expr */
>>
>>   bool ec_merging_done; /* set true once ECs are canonical */
>>
>> +typedef struct UniqueKey
>> +{
>> + NodeTag type;
>> + Bitmapset *unique_expr_indexes;
>> + bool multi_nulls;
>> +} UniqueKey;
>> +
>>
>> PlannerInfo.unique_exprs is a List of unique exprs.  Unique Exprs is a set of
>> EquivalenceClass. for example:
>>
>> CREATE TABLE T1(A INT NOT NULL, B INT NOT NULL, C INT,  pk INT primary key);
>> CREATE UNIQUE INDEX ON t1(a, b);
>>
>> SELECT DISTINCT * FROM T1 WHERE a = c;
>>
>> Then we would have PlannerInfo.unique_exprs as below
>> [
>> [EC(a, c), EC(b)],
>> [EC(pk)]
>> ]
>>
>> RelOptInfo(t1) would have 2 UniqueKeys.
>> UniqueKey1 {unique_expr_indexes=bms{0}, multinull=false]
>> UniqueKey2 {unique_expr_indexes=bms{1}, multinull=false]
>>
>> The design will benefit many table joins cases. For instance a 10- tables join,
>> each table has a primary key (a, b).  Then we would have a UniqueKey like
>> this.
>>
>> JoinRel{1,2,3,4} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b}
>> JoinRel{1,2,3,4,5} -  {t1.a, t1.b, t2.a, t2.b, t3.a, t3.b t4.a t4.b t5.a t5.b}
>>
>> When more tables are joined, the list would be longer and longer, build the list
>> consumes both CPU cycles and memory.
>>
>> With the method as above, we can store it as:
>>
>> root->unique_exprs =  /* All the UniqueKey is stored once */
>> [
>> [t1.a, t1.b], -- EC is ignored in document.
>> [t2.a, t2.b],
>> [t3.a, t3.b],
>> [t4.a, t4.b],
>> [t5.a, t5.b],
>> [t6.a, t6.b],
>> [t7.a, t7.b],
>> [t8.a, t8.b],
>> [t9.a, t9.b],
>> [t10.a, t10.b],
>> ]
>>
>> JoinRel{1,2,3,4} -  Bitmapset{0,1,2,3} -- one bitmapword.
>> JoinRel{1,2,3,4,5} -  Bitmapset{0,1,2,3,4} -- one bitmapword.
>>
>> The member in the bitmap is the index of root->unique_exprs rather than
>> root->eq_classes because we need to store the SingleRow case in
>> root->unqiue_exprs as well.
>>
>> 3. Define a new SingleRow node and use it in joinrel as well.
>>
>> +typedef struct SingleRow
>> +{
>> + NodeTag type;
>> + Index relid;
>> +} SingleRow;
>>
>> SELECT * FROM t1, t2 WHERE t2.pk = 1;
>>
>> root->unique_exprs
>> [
>> [t1.a, t1.b],
>> SingleRow{relid=2}
>> ]
>>
>> JoinRel{t1} - Bitmapset{0}
>> JoinRel{t2} - Bitmapset{1}
>>
>> JoinRelt{1, 2} Bitmapset{0, 1} -- SingleRow will never be expanded to dedicated
>> exprs.
>>
>> 4. Interesting UniqueKey to remove the Useless UniqueKey as soon as possible.
>>
>> The overall idea is similar with PathKey, I distinguish the UniqueKey for
>> distinct purpose and useful_for_merging purpose.
>>
>> SELECT DISTINCT pk FROM  t; -- OK, maintain it all the time, just like
>> root->query_pathkey.
>>
>> SELECT DISTINCT t2.c FROM t1, t2 WHERE t1.d = t2.pk; -- T2's UniqueKey PK is
>> use before t1 join t2, but not useful after it.
>>
>> Currently the known issue I didn't pass the "interesting UniqueKey" info to
>> subquery well [2], I'd like to talk more about this when we discuss about
>> UnqiueKey on subquery part.
>>
>>
>> 5. relation_is_distinct_for
>>
>> Now I design the function as
>>
>> + bool
>> + relation_is_distinct_for(PlannerInfo *root, RelOptInfo *rel, List
>>   *distinct_pathkey)
>>
>> It is "List *distinct_pathkey", rather than "List *exprs". The reason pathkey
>> has EC in it, and all the UniqueKey has EC as well. if so, the subset-like
>> checking is very effective.  As for the distinct/group as no-op case, we have
>> pathkey all the time. The only drawback of it is some clauses are not-sortable,
>> in this case, the root->distinct_pathkey and root->group_pathkey is not
>> set. However it should be rare by practice, so I ignore this part for
>> now. After all, I can have a relation_is_disticnt_for version for Exprs. I just
>> not implemented it so far.
>>
>> 6. EC overhead in UnqiueKey & UNION case.
>>
>> Until now I didn't create any new EC for the UniqueKey case, I just used the
>> existing ones. However I can't handle the case like
>>
>> SELECT a, b FROM t1
>> UNION
>> SELECT x, y FROM t2;
>>
>> In this case, there is no EC created with existing code. and I don't want to
>> create them for the UniqueKey case as well.  so my plan is just not to handle
>> the case for UNION.
>>
>> Since we need some big effort from the reviewer, I split the patch into many
>> smaller chunks.
>>
>> Patch 1 / Patch 2:  I just split some code which UniqueKey uses but not very
>> interrelated. Splitting them out to reduce the core patch size.
>> Patch 3:  still the notnull stuff.  This one doesn't play a big role overall,
>> even if the design is changed at last, we can just modify some small stuff. IMO,
>> I don't think it is a blocker issue to move on.
>> Patch 4:  Support the UniqueKey for baserel.
>> Patch 5: Support the UniqueKey for joinrel.
>> Patch 6: Support the UniqueKey for some upper relation, like distinctrel,
>> groupby rel.
>>
>> I'd suggest moving on like this:
>> 1. Have an overall review to see if any outstanding issues the patch has.
>> 2. If not, we can review and commit patch 1 & patch 2 to reduce the patch size.
>> 3. Decide which method to take for not null stuff. IMO Tom's method
>> would be a bit
>>    complex and the benefit is not very clear to me[1].  So the choices
>>    include: a). move on UniqueKey stuff until Tom's method is ready. b). Move
>>    on the UniqueKey with my notnull way, and changes to Tom's method when
>>    necessary. I prefer method b).
>> 4. Review & Commit the UniqueKey for BaseRel part.
>> 5. Review & Commit the UniqueKey for JoinRel part.
>> 6. Review & Commit the UniqueKey for SubQuery part *without* the Interesting
>>    UniqueKey well handled.
>> 7. Review & Commit the UniqueKey for SubQuery part *with* the Interesting
>>    UniqueKey well handled.
>> 8. Discuss about the UniqueKey on partitioned rel case and develop/review/commit
>>    it.
>> 9. Apply UniqueKey stuff on more user cases rather than DISTINCT.
>>
>> What do you think about this?
>>
>> [1] https://www.postgresql.org/message-id/CAApHDvo5b2pYX%2BdbPy%2BysGBSarezRSfXthX32TZNFm0wPdfKGQ%40mail.gmail.com
>> [2]
https://www.postgresql.org/message-id/CAKU4AWo6-%3D9mg3UQ5UJhGCMw6wyTPyPGgV5oh6dFvwEN%3D%2Bhb_w%40mail.gmail.com
>>
>>
>> Thanks
>
> Hi,
> For v3-0005-Support-UniqueKey-on-JoinRel.patch :
>
> +static void populate_joinrel_composited_uniquekey(PlannerInfo *root,
>
> populate_joinrel_composited_uniquekey -> populate_joinrel_composite_uniquekey (without the trailing d for composite)
>
> For populate_joinrel_uniquekeys():
>
> +       foreach(lc, outerrel->uniquekeys)
> +       {
> ...
> +           return;
>
> Should the remaining unique keys be considered ?
>
> For populate_joinrel_uniquekey_for_rel():
>
> +       else if (bms_equal(r->right_relids, rel->relids) && r->left_ec != NULL)
> +       {
> +           other_ecs = lappend(other_ecs, r->right_ec);
> +           other_relids = bms_add_members(other_relids, r->left_relids);
>
> It seems the append to other_ecs is the same as the one for the `bms_equal(r->left_relids, rel->relids) &&
r->right_ec!= NULL` case. Is this correct ?
 
>

Correct, both of them are bugs, I will fix them in the next version,
including the "tailing d".
Thanks for your review!

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:
On Tue, Jul 6, 2021 at 9:14 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> David Rowley <dgrowleyml@gmail.com> writes:
> > Tom, I'm wondering if you might get a chance to draw up a design for
> > what you've got in mind with this?  I assume adding a new field in
> > Var, but I'm drawing a few blanks on how things might work for equal()
> > when one Var has the field set and another does not.
>
> As I said before, it hasn't progressed much past the handwaving stage,
> but it does seem like it's time to get it done.  I doubt I'll have any
> cycles for it during the commitfest, but maybe I can devote a block of
> time during August.
>
>                         regards, tom lane

Hi Tom:  do you get a chance to work on this? Looks like we have to fix
this one before we can move on to the uniquekey stuff.

-- 
Best Regards
Andy Fan



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:
Unknown why people have so little interest in this, AFAICS, there are more great
usages of UniqueKey rather than the 'marking-distinct-as-noop'. The most
exciting usage for me is it is helpful for JoinRel's pathkey.

Take an example of this:

SELECT .. FROM t1 JOIN t2 ON t1.any_column = t2.uniquekey;
SELECT .. FROM t1 LEFT JOIN t2 ON t1.any_column = t2.uniquekey;

Suppose before the join, t1 has a pathkey X,  t2 has PathKey y. Then
(t1.X, t2.Y) is
ordered as well for JoinRel(t1, t2).  Then the pathkey of JoinRel(t1,
t2)  has a lot
of usage again.  Currently after the joining, only the outer join's
pathkey is maintained.

As for the extra planning cost of this, it looks like our current
infrastructure can support it
well since we know all the information when we generate the pathkey
for the Join Path.

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Dmitry Dolgov
Date:
> On Wed, Jul 07, 2021 at 01:20:24PM +1200, David Rowley wrote:
> On Wed, 7 Jul 2021 at 13:04, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> > Looking forward to watching this change closely, thank you both David and Tom!
> > But I still don't understand what the faults my way have , do you mind telling the
> > details?
>
> The problem is that we don't need 6 different ways to determine if a
> Var can be NULL or not.  You're proposing to add a method using
> Bitmapsets and Tom has some proposing ideas around tracking
> nullability in Vars.  We don't need both.
>
> It seems to me that having it in Var allows us to have a much finer
> gradient about where exactly a Var can be NULL.
>
> For example: SELECT nullablecol FROM tab WHERE nullablecol = <value>;
>
> If the equality operator is strict then the nullablecol can be NULL in
> the WHERE clause but not in the SELECT list. Tom's idea should allow
> us to determine both of those things but your idea cannot tell them
> apart, so, in theory at least, Tom's idea seems better to me.

Hi,

This patch still occupies some place in my head, so I would like to ask few
questions to see where it's going:

* From the last emails in this thread I gather that the main obstacle from the
  design side of things is functionality around figuring out if a Var could be
  NULL or not, and everyone is waiting for a counterproposal about how to do
  that better. Is that correct?

* Is this thread only about notnullattrs field in RelOptInfo, or about the
  UniqueKeys patch series after all? The title indicates the first one, but the
  last posted patch series included everything as far as I can see.

* Putting my archaeologist's hat on, I've tried to find out what this
  alternative proposal was about. The result findings are scattered through the
  archives -- which proves that it's a hard topic indeed -- and participants of
  this thread are probably more aware about them than I am. The most detailed
  handwaving I found in the thread [1], with an idea to introduce NullableVar
  wrapper created by parser, is that it? It makes more clear why such approach
  could be more beneficial than a new field in RelOptInfo. And if the thread is
  only about the notnullattrs, I guess it would be indeed enough to object.

* Now, how essential is notnullattrs functionality for the UniqueKeys patch
  series? From what I understand, it's being used to set multi_nulls field of
  every UniqueKey to indicate whether this key could produce NULL or not (which
  means no guaranties about uniqueness could be provided). Is there a way to
  limit the scope of the patch series and introduce UniqueKeys without require
  multi_nulls at all, or (again, in some limited situations) fetch necessary
  information somehow on the fly e.g. only from catcache without introducing
  any new infrastructure?

[1]: https://www.postgresql.org/message-id/25142.1580847861%40sss.pgh.pa.us



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Andy Fan
Date:
Hi Dmitry:

On Wed, Nov 17, 2021 at 11:20 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Wed, Jul 07, 2021 at 01:20:24PM +1200, David Rowley wrote:
> > On Wed, 7 Jul 2021 at 13:04, Andy Fan <zhihui.fan1213@gmail.com> wrote:
> > > Looking forward to watching this change closely, thank you both David and Tom!
> > > But I still don't understand what the faults my way have , do you mind telling the
> > > details?
> >
> > The problem is that we don't need 6 different ways to determine if a
> > Var can be NULL or not.  You're proposing to add a method using
> > Bitmapsets and Tom has some proposing ideas around tracking
> > nullability in Vars.  We don't need both.
> >
> > It seems to me that having it in Var allows us to have a much finer
> > gradient about where exactly a Var can be NULL.
> >
> > For example: SELECT nullablecol FROM tab WHERE nullablecol = <value>;
> >
> > If the equality operator is strict then the nullablecol can be NULL in
> > the WHERE clause but not in the SELECT list. Tom's idea should allow
> > us to determine both of those things but your idea cannot tell them
> > apart, so, in theory at least, Tom's idea seems better to me.
>
> Hi,
>
> This patch still occupies some place in my head, so I would like to ask few
> questions to see where it's going:

Thanks for that!

> * From the last emails in this thread I gather that the main obstacle from the
>   design side of things is functionality around figuring out if a Var could be
>   NULL or not, and everyone is waiting for a counterproposal about how to do
>   that better. Is that correct?

That is correct.

> * Is this thread only about notnullattrs field in RelOptInfo, or about the
>   UniqueKeys patch series after all? The title indicates the first one, but the
>   last posted patch series included everything as far as I can see.

This thread is talking about the path series after all. Not null maintenance
is the first step of the UniqueKey patch.  If the not null issue can't
be addressed,
the overall UniqueKey patch would be hopeless.

> * Putting my archaeologist's hat on, I've tried to find out what this
>   alternative proposal was about. The result findings are scattered through the
>   archives -- which proves that it's a hard topic indeed -- and participants of
>   this thread are probably more aware about them than I am. The most detailed
>   handwaving I found in the thread [1], with an idea to introduce NullableVar
>   wrapper created by parser, is that it? It makes more clear why such approach
>   could be more beneficial than a new field in RelOptInfo. And if the thread is
>   only about the notnullattrs, I guess it would be indeed enough to object.
>
> * Now, how essential is notnullattrs functionality for the UniqueKeys patch
>   series? From what I understand, it's being used to set multi_nulls field of
>   every UniqueKey to indicate whether this key could produce NULL or not (which
>   means no guaranties about uniqueness could be provided). Is there a way to
>   limit the scope of the patch series and introduce UniqueKeys without require
>   multi_nulls at all, or (again, in some limited situations) fetch necessary
>   information somehow on the fly e.g. only from catcache without introducing
>   any new infrastructure?
>

_I_ think there is no way to bypass that.  I guess Tom has a bigger plan on
Var (not only for notnull), but no time to invest in them so far.  If
that is the case,
personally I think we can go ahead with my method first and continue the review
process.


-- 
Best Regards
Andy Fan



Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)

From
Greg Stark
Date:
On Mon, 22 Nov 2021 at 02:14, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> _I_ think there is no way to bypass that.  I guess Tom has a bigger plan on
> Var (not only for notnull), but no time to invest in them so far.  If
> that is the case,
> personally I think we can go ahead with my method first and continue the review
> process.

This discussion has gone on for two years now and meandered into
different directions. There have been a number of interesting
proposals and patches in that time but it's not clear now what patch
is even under consideration and what questions remain for it. And I
think this message from last November is the last comment on it so I
wonder if it's reached a bit of an impasse.

I think I would suggest starting a fresh thread with a patch distilled
from the previous discussions. Then once that's settled repeat with
additional patches, keeping the discussion focused just on the current
change.

Personally I think these kinds of optimizations are important because
they're what allow people to use SQL without micro-optimizing each
query.

-- 
greg