Re: UniqueKey v2 - Mailing list pgsql-hackers
From | Antonin Houska |
---|---|
Subject | Re: UniqueKey v2 |
Date | |
Msg-id | 3739.1715594150@antos Whole thread Raw |
In response to | Re: UniqueKey v2 (Andy Fan <zhihuifan1213@163.com>) |
Responses |
Re: UniqueKey v2
Re: UniqueKey v2 |
List | pgsql-hackers |
Andy Fan <zhihuifan1213@163.com> wrote: > > * I think that, before EC is considered suitable for an UK, its ec_opfamilies > > field needs to be checked. I try to do that in > > find_ec_position_matching_expr(), see 0004. > > Could you make the reason clearer for adding 'List *opfamily_lists;' > into UniqueKey? You said "This is needed to create ECs in the parent > query if the upper relation represents a subquery." but I didn't get the > it. Since we need to maintain the UniqueKey in the many places, I'd like > to keep it as simple as possbile. Of course, anything essentical should > be added for sure. If unique keys are generated for a subquery output, they also need to be created for the corresponding relation in the upper query ("sub" in the following example): select * from tab1 left join (select * from tab2) sub; However, to create an unique key for "sub", you need an EC for each expression of the key. And to create an EC, you in turn need the list of operator families. Even if the parent query already had ECs for the columns of "sub" which are contained in the unique key, you need to make sure that those ECs are "compatible" with the ECs of the subquery which generated the unique key. That is, if an EC of the subquery considers certain input values equal, the EC of the parent query must also be able to determine if they are equal or not. > > * RelOptInfo.notnullattrs > > > > My understanding is that this field contains the set attributes whose > > uniqueness is guaranteed by the unique key. They are acceptable because they > > are either 1) not allowed to be NULL due to NOT NULL constraint or 2) NULL > > value makes the containing row filtered out, so the row cannot break > > uniqueness of the output. Am I right? > > > > If so, I suggest to change the field name to something less generic, maybe > > 'uniquekey_attrs' or 'uniquekey_candidate_attrs', and adding a comment that > > more checks are needed before particular attribute can actually be used in > > UniqueKey. > > I don't think so, UniqueKey is just one of the places to use this > not-null property, see 3af704098 for the another user case of it. > > (Because of 3af704098, we should leverage notnullattnums somehow in this > patch, which will be included in the next version as well). In your patch you modify 'notnullattrs' in add_base_clause_to_rel(), but that does not happen to 'notnullattnums' in the current master branch. Thus I think that 'notnullattrs' is specific to the unique keys feature, so the field name should be less generic. > > > > * uniquekey_useful_for_merging() > > > > How does uniqueness relate to merge join? In README.uniquekey you seem to > > point out that a single row is always sorted, but I don't think this > > function is related to that fact. (Instead, I'd expect that pathkeys are > > added to all paths for a single-row relation, but I'm not sure you do that > > in the current version of the patch.) > > The merging is for "mergejoinable join clauses", see function > eclass_useful_for_merging. Usually I think it as operator "t1.a = t2.a"; My question is: why is the uniqueness important specifically to merge join? I understand that join evaluation can be more efficient if we know that one input relation is unique (i.e. we only scan that relation until we find the first match), but this is not specific to merge join. > > * is_uniquekey_useful_afterjoin() > > > > Now that my patch (0004) allows propagation of the unique keys from a > > subquery to the upper query, I was wondering if the UniqueKey structure > > needs the 'use_for_distinct field' I mean we should now propagate the unique > > keys to the parent query whether the subquery has DISTINCT clause or not. I > > noticed that the field is checked by is_uniquekey_useful_afterjoin(), so I > > changed the function to always returned true. However nothing changed in the > > output of regression tests (installcheck). Do you insist that the > > 'use_for_distinct' field is needed? I miss your answer to this comment. > > * uniquekey_contains_multinulls() > > > > ** Instead of calling the function when trying to use the UK, how about > > checking the ECs when considering creation of the UK? If the tests fail, > > just don't create the UK. > > I don't think so since we maintain the UniqueKey from bottom to top, you > can double check if my reason is appropriate. > > CREATE TABLE t1(a int); > CREATE INDEX ON t1(a); > > SELECT distinct t1.a FROM t1 JOIN t2 using(a); > > We need to create the UniqueKey on the baserel for t1 and the NULL > values is filtered out in the joinrel. so we have to creating it with > allowing NULL values first. ok > > ** What does the 'multi' word in the function name mean? > > multi means multiple, I thought we use this short name in the many > places, for ex bt_multi_page_stats after a quick search. Why not simply uniquekey_contains_nulls() ? Actually I wouldn't say that an instance of UniqueKey contains any value (NULL or NOT NULL) because it describes the whole relation rather than particular row. I consider UniqueKey to be a set of expressions. How about uniquekey_expression_nullable() ? > > > > > > * Combining the UKs > > > > IMO this is the most problematic part of the patch. You call > > populate_joinrel_uniquekeys() for the same join multiple times, > > Why do you think so? The below code is called in "make_join_rel" Consider join of tables "a", "b" and "c". My understanding is that make_join_rel() is called once with rel1={a} and rel2={b join c}, then with rel1={a join b} and rel2={c}, etc. I wanted to say that each call should produce the same set of unique keys. I need to check this part more in detail. > Is your original question is about populate_joinrel_uniquekey_for_rel > rather than populate_joinrel_uniquekeys? We have the below codes: > > outeruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, outerrel, > innerrel, restrictlist); > inneruk_still_valid = populate_joinrel_uniquekey_for_rel(root, joinrel, innerrel, > outerrel, restrictlist); > > This is mainly because of the following theory. Quoted from > README.uniquekey. Let's called this as "rule 1". > > """ > How to maintain the uniquekey? > ------------------------------- > .. From the join relation, it is maintained with two rules: > > - the uniquekey in one side is still unique if it can't be duplicated > after the join. for example: > > SELECT t1.pk FROM t1 JOIN t2 ON t1.a = t2.pk; > UniqueKey on t1: t1.pk > UniqueKey on t1 Join t2: t1.pk > """ > > AND the blow codes: > > > if (outeruk_still_valid || inneruk_still_valid) > > /* > * the uniquekey on outers or inners have been added into joinrel so > * the combined uniuqekey from both sides is not needed. > */ > return; > > > We don't create the component uniquekey if any one side of the boths > sides is unique already. For example: > > "(t1.id) in joinrel(t1, t2) is unique" OR "(t2.id) in joinrel is > unique", there is no need to create component UniqueKey (t1.id, t2.id); ok, I need to check more in detail how this part works. > > > > Of course one problem is that the number of combinations can grow > > exponentially as new relations are joined. > > Yes, that's why "rule 1" needed and "How to reduce the overhead" in > UniqueKey.README is introduced. What if we are interested in unique keys of a subquery, but the subquery has no DISTINCT clause? > > > > 2) Check if the join relation is single-row > > > > I in order to get rid of the dependency on 'restrictlist', I think you can > > use ECs. Consider a query from your regression tests: > > > > CREATE TABLE uk_t (id int primary key, a int not null, b int not null, c int, d int, e int); > > > > SELECT distinct t1.d FROM uk_t t1 JOIN uk_t t2 ON t1.e = t2.id and t1.id = 1; > > > > The idea here seems to be that no more than one row of t1 matches the query > > clauses. Therefore, if t2(id) is unique, the clause t1.e=t2.id ensures that > > no more than one row of t2 matches the query (because t1 cannot provide the > > clause with more than one input value of 'e'). And therefore, the join also > > produces at most one row. > > You are correct and IMO my current code are able to tell it is a single > row as well. > > 1. Since t1.id = 1, so t1 is single row, so t1.d is unqiuekey as a > consequence. > 2. Given t2.id is unique, t1.e = t2.id so t1's unqiuekey can be kept > after the join because of rule 1 on joinrel. and t1 is singlerow, so the > joinrel is singlerow as well. > > I'm interested with "get rid of the dependency on 'restrictlist', I > think you can use ECs.", let's see what we can improve. > > > > My theory is that relation is single-row if it has an UK such that each of > > its ECs meets at least one of the following conditions: > > > > a) contains a constant > > True. > > > > b) contains a column of a relation which has already been proven single-row. > > True, not sure if it is easy to tell. > > > b) is referenced by an UK of a relation which has already been proven > > single-row. > > I can't follow here... This is similar to EC containing a constant: if an EC is used by a single-row UK, all its member can only have a single value. > > > > I think that in the example above, an EC {t1.e, t2.id} should exist. So when > > checking whether 't2' is single-row, the condition b) cam be used: the UK of > > 't2' should reference the EC {t1.e, t2.id}, which in turn contains the > > column t1.e. And 't1' is unique because its EC meets the condition a). (Of > > course you need to check em_jdomain before you use particular EM.) > > I think the existing rule 1 for joinrel works well with the singlerow > case naturally, what can be improved if we add the theory you suggested > here? This is still the explanation of the idea how to mark join unique key as a single-row separately from the other logic. As noted above, I need to learn more about the unique keys of a join. -- Antonin Houska Web: https://www.cybertec-postgresql.com
pgsql-hackers by date: