Re: UniqueKey v2 - Mailing list pgsql-hackers

From Andy Fan
Subject Re: UniqueKey v2
Date
Msg-id 87h6f1kqr6.fsf@163.com
Whole thread Raw
In response to Re: UniqueKey v2  (Antonin Houska <ah@cybertec.at>)
Responses Re: UniqueKey v2
List pgsql-hackers
Antonin Houska <ah@cybertec.at> writes:

>> Could you make the reason clearer for adding 'List *opfamily_lists;'
>> into UniqueKey?  You said "This is needed to create ECs in the parent
>> query if the upper relation represents a subquery." but I didn't get the
>> it. Since we need to maintain the UniqueKey in the many places, I'd like
>> to keep it as simple as possbile. Of course, anything essentical should
>> be added for sure. 
>
> If unique keys are generated for a subquery output, they also need to be
> created for the corresponding relation in the upper query ("sub" in the
> following example):

OK.
>
> select * from tab1 left join (select * from tab2) sub;
>
> However, to create an unique key for "sub", you need an EC for each expression
> of the key.

OK.
> And to create an EC, you in turn need the list of operator
> families.

I'm thinking if we need to "create" any EC. Can you find out a user case
where the outer EC is missed and the UniqueKey is still interesting? I
don't have an example now.  

convert_subquery_pathkeys has a similar sistuation and has the following
codes:

                outer_ec =
                    get_eclass_for_sort_expr(root,
                                             (Expr *) outer_var,
                                             sub_eclass->ec_opfamilies,
                                             sub_member->em_datatype,
                                             sub_eclass->ec_collation,
                                             0,
                                             rel->relids,
                                             NULL,
                                             false);

                /*
                 * If we don't find a matching EC, sub-pathkey isn't
                 * interesting to the outer query
                 */
                if (outer_ec)
                    best_pathkey =
                        make_canonical_pathkey(root,
                                               outer_ec,
                                               sub_pathkey->pk_opfamily,
                                               sub_pathkey->pk_strategy,
                                               sub_pathkey->pk_nulls_first);
            }

> Even if the parent query already had ECs for the columns of "sub" which are
> contained in the unique key, you need to make sure that those ECs are
> "compatible" with the ECs of the subquery which generated the unique key. That
> is, if an EC of the subquery considers certain input values equal, the EC of
> the parent query must also be able to determine if they are equal or not.
>
>> > * RelOptInfo.notnullattrs
>> >
>> >   My understanding is that this field contains the set attributes whose
>> >   uniqueness is guaranteed by the unique key. They are acceptable because they
>> >   are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL
>> >   value makes the containing row filtered out, so the row cannot break
>> >   uniqueness of the output. Am I right?
>> >
>> >   If so, I suggest to change the field name to something less generic, maybe
>> >   'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that
>> >   more checks are needed before particular attribute can actually be used in
>> >   UniqueKey.
>> 
>> I don't think so, UniqueKey is just one of the places to use this
>> not-null property, see 3af704098 for the another user case of it. 
>> 
>> (Because of 3af704098, we should leverage notnullattnums somehow in this
>> patch, which will be included in the next version as well).
>
> In your patch you modify 'notnullattrs' in add_base_clause_to_rel(), but that
> does not happen to 'notnullattnums' in the current master branch. Thus I think
> that 'notnullattrs' is specific to the unique keys feature, so the field name
> should be less generic.

OK.

>> >
>> > * uniquekey_useful_for_merging()
>> >
>> >   How does uniqueness relate to merge join? In README.uniquekey you seem to
>> >   point out that a single row is always sorted, but I don't think this
>> >   function is related to that fact. (Instead, I'd expect that pathkeys are
>> >   added to all paths for a single-row relation, but I'm not sure you do that
>> >   in the current version of the patch.)
>> 
>> The merging is for "mergejoinable join clauses", see function
>> eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a";
>
> My question is: why is the uniqueness important specifically to merge join? I
> understand that join evaluation can be more efficient if we know that one
> input relation is unique (i.e. we only scan that relation until we find the
> first match), but this is not specific to merge join.

So the answer is the "merging" in uniquekey_useful_for_merging() has
nothing with merge join. 

>> > * is_uniquekey_useful_afterjoin()
>> >
>> >   Now that my patch (0004) allows propagation of the unique keys from a
>> >   subquery to the upper query, I was wondering if the UniqueKey structure
>> >   needs the 'use_for_distinct field' I mean we should now propagate the unique
>> >   keys to the parent query whether the subquery has DISTINCT clause or not. I
>> >   noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I
>> >   changed the function to always returned true. However nothing changed in the
>> >   output of regression tests (installcheck). Do you insist that the
>> >   'use_for_distinct' field is needed?
>
> I miss your answer to this comment.

After we considers the uniquekey from subquery, 'use_for_distinct' field
is not needed.

>> >   ** What does the 'multi' word in the function name mean?
>> 
>> multi means multiple, I thought we use this short name in the many
>> places, for ex bt_multi_page_stats after a quick search. 
>
> Why not simply uniquekey_contains_nulls() ?

> Actually I wouldn't say that an instance of UniqueKey contains any value (NULL
> or NOT NULL) because it describes the whole relation rather than particular
> row. I consider UniqueKey to be a set of expressions. How about
> uniquekey_expression_nullable() ?

uniquekey_expression_nullable() is a better name, I will use it in the
next version.

IIUC, we have reached to the agreement based on your latest response for
the most of the questions. Please point me if I missed anything.  

>> >   Of course one problem is that the number of combinations can grow
>> >   exponentially as new relations are joined.
>> 
>> Yes, that's why "rule 1" needed and "How to reduce the overhead" in
>> UniqueKey.README is introduced. 
>
> What if we are interested in unique keys of a subquery, but the subquery has
> no DISTINCT clause?

I agree we should remove the prerequisite of "use_for_distinct". 

-- 
Best Regards
Andy Fan




pgsql-hackers by date:

Previous
From: Andy Fan
Date:
Subject: Re: First draft of PG 17 release notes
Next
From: Erik Wienhold
Date:
Subject: Underscore in positional parameters?