Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series) - Mailing list pgsql-hackers

From Andy Fan
Subject Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)
Date
Msg-id CAKU4AWp3WKyrMKNdg46BvQRD7xkNL9UsLLcLhd5ao=FSbnaN_Q@mail.gmail.com
Whole thread Raw
In response to Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)  (Andy Fan <zhihui.fan1213@gmail.com>)
Re: Keep notnullattrs in RelOptInfo (Was part of UniqueKey patch series)  (Andy Fan <zhihui.fan1213@gmail.com>)
List pgsql-hackers


On Tue, Feb 16, 2021 at 12:01 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Fri, 12 Feb 2021 at 15:18, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>
> On Fri, Feb 12, 2021 at 9:02 AM David Rowley <dgrowleyml@gmail.com> wrote:
>> The reason I don't really like this is that it really depends where
>> you want to use RelOptInfo.notnullattrs.  If someone wants to use it
>> to optimise something before the base quals are evaluated then they
>> might be unhappy that they found some NULLs.
>>
>
> Do you mean the notnullattrs is not set correctly before the base quals are
> evaluated?  I think we have lots of data structures which are set just after some
> stage.  but notnullattrs is special because it is set at more than 1 stage.  However
> I'm doubtful it is unacceptable, Some fields ever change their meaning at different
> stages like Var->varno.  If a user has a misunderstanding on it, it probably will find it
> at the testing stage.

You're maybe focusing too much on your use case for notnullattrs. It
only cares about NULLs in the result for each query level.

.... thinks of an example...

OK, let's say I decided that COUNT(*) is faster than COUNT(id) so
decided that I might like to write a patch which rewrite the query to
use COUNT(*) when it was certain that "id" could not contain NULLs.

The query is:

SELECT p.partid, p.partdesc,COUNT(s.saleid) FROM part p LEFT OUTER
JOIN sales s ON p.partid = s.partid GROUP BY p.partid;

sale.saleid is marked as NOT NULL in pg_attribute.  As the writer of
the patch, I checked the comment for notnullattrs and it says "Not
null attrs, start from -FirstLowInvalidHeapAttributeNumber", so I
should be ok to assume since sales.saleid is marked in notnullattrs
that I can rewrite the query?!

The documentation about the RelOptInfo.notnullattrs needs to be clear
what exactly it means. I'm not saying your representation of how to
record NOT NULL in incorrect. I'm saying that you need to be clear
what exactly is being recorded in that field.

If you want it to mean "attribute marked here cannot output NULL
values at this query level", then you should say something along those
lines.

However, having said that, because this is a Bitmapset of
pg_attribute.attnums, it's only possible to record Vars from base
relations.  It does not seem like you have any means to record
attributes that are normally NULLable, but cannot produce NULL values
due to a strict join qual.

e.g: SELECT t.nullable FROM t INNER JOIN j ON t.nullable = j.something;

I'd expect the RelOptInfo for t not to contain a bit for the
"nullable" column, but there's no way to record the fact that the join
RelOptInfo for {t,j} cannot produce a NULL for that column. It might
be quite useful to know that for the UniqueKeys patch.


I read your comments again and find I miss your point before. So I'd summarize
the my current understanding to make sure we are in the same place for further
working.

I want to define a notnullattrs on RelOptInfo struct. The not nullable
may comes from catalog definition or quals on the given query. For example:

CREATE TABLE t(a INT NOT NULL, nullable INT);
SELECT * FROM t; ==>  a is not null for sure by definition.
SELECT * FROM t WHERE nullable > 3; ==> nullable is not null as well by qual.

However the thing becomes complex with the below 2 cases.

1. SELECT * FROM t INNER JOIN j on t.nullable = q.b;
We know t.b will not be null **finally**. But the current plan may something
like this:

                QUERY PLAN
------------------------------------------
 Merge Join
   Merge Cond: (t.nullable = j.something)
   ->  Sort
         Sort Key: t.nullable
         ->  Seq Scan on t
   ->  Sort
         Sort Key: j.something
         ->  Seq Scan on j
(8 rows)

which means the Path "Seq Scan on t" still contains some null values. At least,
we should not assume t.nullable is "not nullable" the base relation stage.

2. SELECT t.a FROM j LEFT JOIN t ON t.b = t.a;
Even the t.a is not null by definition, but it may have null **finally** due to
the outer join.

My current patch doesn't handle the 2 cases well since t.nullable is marked as
NOT NULL for both cases.
 
I know there's another discussion here between Ashutosh and Tom about
PathTarget's and Vars.   I had the Var idea too once myself [1] but it
was quickly shot down.  Tom's reasoning there in [1] seems legit.  I
guess we'd need some sort of planner version of Var and never confuse
it with the Parse version of Var.  That sounds like quite a big
project which would have quite a lot of code churn. I'm not sure how
acceptable it would be to have Var represent both these things.  It
gets complex when you do equal(var1, var2) and expect that to return
true when everything matches apart from the notnull field. We
currently have this issue with the "location" field and we even have a
special macro which just ignores those in equalfuncs.c.  I imagine not
many people would like to expand that to other fields.

It would be good to agree on the correct representation for Vars that
cannot produce NULLs so that we don't shut the door on classes of
optimisation that require something other than what you need for your
case.


Looks we have to maintain not null on the general RelOptInfo level rather than Base
RelOptInfo.  But I don't want to teach Var about the notnull so far.  The reasons are: 1).
We need to maintain the Planner version and Parser version due to the VIEW case. 
2). We have to ignore the extra part  for equal(Var, Var) . 3). Var is usually shared among
different RelOptInfo. which means we have to maintain different copies for this purpose IIUC.

I assume we want to know if a Var is nullable with a function like. 
is_var_notnullable(Var *var,  Relids relids).  If so, we can define the data as below:

struct RelOptInfo {

Bitmapset** notnullattrs;
..
}; 

After this we can implement the function as: 

bool
is_var_notnullable(Var* var, Relids relids)
{
  RelOptInfo *rel = find_rel_by_relids(reldis);
  return bms_is_member(var->varattno, rel->notnullattrs[var->varno]);
}

Probably we can make some hackers to reduce the notnullattrs's memory usage
overhead.

Any thoughts?

 
David

[1] https://www.postgresql.org/message-id/flat/14678.1401639369%40sss.pgh.pa.us#d726d397f86755b64bb09d0c487f975f


--
Best Regards

pgsql-hackers by date:

Previous
From: Noah Misch
Date:
Subject: Re: public schema default ACL
Next
From: Floris Van Nee
Date:
Subject: RE: non-HOT update not looking at FSM for large tuple update