Re: [PATCH] Keeps tracking the uniqueness with UniqueKey - Mailing list pgsql-hackers

From David Rowley
Subject Re: [PATCH] Keeps tracking the uniqueness with UniqueKey
Date
Msg-id CAApHDvr_c5EQ3gBvogSTNgm-FOZ=5sfH4snWTe=+tEQ2Eg_=xA@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Keeps tracking the uniqueness with UniqueKey  (Andy Fan <zhihui.fan1213@gmail.com>)
List pgsql-hackers
On Wed, 1 Apr 2020 at 13:45, Andy Fan <zhihui.fan1213@gmail.com> wrote:
>> 5. I think you should be performing a bms_del_member during join
>> removal rather than removing this Assert()
>>
>> - Assert(bms_equal(rel->relids, root->all_baserels));
>>
>> FWIW, it's far from perfect that you've needed to delay the left join
>> removal, but I do understand why you've done it. It's also far from
>> perfect that you're including removed relations in the
>> total_table_pages calculation. c6e4133fae1 took some measures to
>> improve this calculation and this is making it worse again.
>>
> Since the removed relation depends on the UniqueKey which has to be
> calculated after  total_table_pages calculation in current code, so that's
> something I must do.  But if the relation is not removable,  there is no waste
> at all.  If it is removable,  such gain will much higher than the loss.  I'm
> not sure this should be a concern.

The reason join removals was done so early in planning before was to
save the planner from having to do additional work for relations which
were going to be removed later.  For example, building path lists.

> Actually looks the current remove_useless_join has some limits which can't
> remove a joinrel,  I still didn't figure out why.  In the past we have some limited
> ability to detect the unqiueness after join, so that's would be ok.  Since  we have
> such ability now,  this may be another opportunity to improve the join_is_removable
> function, but I'd not like put such thing in this patch.

Yeah, there's certainly more left join shapes that we could remove.
e.g when the left join relation is not a singleton rel.  We shouldn't
do anything to purposefully block additional join removals as a result
of adding UniqueKeys, but likely shouldn't go to any trouble to make
additional ones work. That can be done later.

> Since you said "far from perfect" twice for this point and I only get one reason (we
> may plan a node which we removed later),  do I miss the other one?

a) additional planning work by not removing the join sooner. b) wrong
total page calculation.

In theory b) could be fixed by subtracting the removed join rels pages
after we remove it, but unfortunately, there's no point since we've
built the paths by that time already and we really only use the value
to determine how much IO is going to be random vs sequential, which is
determined during set_base_rel_pathlists()

>> d. not_null_cols should not be a Relids type, it should be Bitmapset.
>>
> If it is a Bitmapset,  we have to pass it with "&" usually.  is it our practice?

Well, a Bitmapset pointer.   Relids is saved for range table indexes.
Storing anything else in there is likely to lead to confusion.

>> 12. The comment in the code below is not true. The List contains
>> Lists, of which contain UniqueKeys
>>
>> List    *uniquekeys; /* List of UniqueKey */
>>
> It is a list of UniqueKey,  the UniqueKey can have a list of exprs.

Hmm, so this is what I called a UniqueKeySet in the original patch.
I'm a bit divided by that change. With PathKeys, technically you can
make use of a Path with a given set of PathKeys if you only require
some leading subset of those keys.  That's not the case for
UniqueKeys, it's all or nothing, so perhaps having the singular name
is better than the plural name I gave it. However, I'm not certain.

(Really PathKey does not seem like a great name in the first place
since it has nothing to do with keys)

>> 13. I'm having trouble parsing the final sentence in:
>>
>> + * can only guarantee the uniqueness without considering the null values. This
>> + * field is necessary for remove_useless_join & reduce_unique_semijions since
>> + * these cases don't care about the null values.
>>
>> Why is the field which stores the nullability of the key required for
>> code that does not care about the nullability of the key?
>>
> The guarantee is introduced to for the following cases:
>
> create table t1 (a int primary key, b int);
> create table t2 (a int primary key, b int);
> select .. from t1,  (select b from t2 group by t2) t2 ..;
>
> -- b is nullable.  so t2(b) can't be a normal UniqueKey (which means b may have some
> duplicated rows)
> create unique index t2_uk_b on t2(b);
>
> -- the left join still can be removed since t2.b is a unique index and the nullable
> doesn't matter here.
> select t1.* from t1 left join t2 on (t1.b = t2.b);
>
> do you think we have can do some optimization in this case? I don't understand
> your question well.

OK, so by "don't care", you mean, don't duplicate NULL values.  I
assumed you had meant that it does not matter either way, as in: don't
mind if there are NULL values or not. It might be best to have a go at
changing the wording to be more explicit to what you mean there.



pgsql-hackers by date:

Previous
From: 曾文旌
Date:
Subject: Re: [Proposal] Global temporary tables
Next
From: Prabhat Sahu
Date:
Subject: Re: [Proposal] Global temporary tables