Re: Is there value in having optimizer stats for joins/foreignkeys? - Mailing list pgsql-hackers
| From | Corey Huinker |
|---|---|
| Subject | Re: Is there value in having optimizer stats for joins/foreignkeys? |
| Date | |
| Msg-id | CADkLM=fRCC3m3DJPPgYEHOJy-+932Q4RMm3Vqz7BQ_JPFASPWg@mail.gmail.com Whole thread Raw |
| In response to | Re: Is there value in having optimizer stats for joins/foreignkeys? (Alexandra Wang <alexandra.wang.oss@gmail.com>) |
| List | pgsql-hackers |
I have indeed started by implementing MCV statistics for joins,
because I have not found a case for joins that would benefit only from
ndistinct or functional dependency stats that MCV stats wouldn't help.
That was a big question I had, and I agree that we should only add statistics as uses for them become apparent.
In my POC patch, I've made the following catalog changes:- Add stxotherrel (oid) and stxjoinkeys (int2vector) fields to pg_statistic_ext
- Use the existing stxkeys (int2vector) to store the stats object attributes of stxotherrel
- Create pg_statistic_ext_otherrel_index on (stxrelid, stxotherrel)
- Add stxdjoinmcv (pg_join_mcv_list) to pg_statistic_ext_data
I like all these changes. Maybe "outer" rel rather than "other" rel, but it really doesn't matter this early on.
To use them, we can let the planner detect patterns like this:/*
* JoinStatsMatch - Information about a detected join pattern
* Used internally to track what was matched in a join+filter pattern
*/
typedef struct JoinStatsMatch
{
Oid target_rel; /* table OID of the estimation target */
AttrNumber targetrel_joinkey; /* target_rel's join column */
Oid other_rel; /* table OID of the filter source */
AttrNumber otherrel_joinkey; /* other_rel's join column */
List *filter_attnums; /* list of AttrNumbers for filter columns in other_rel */
List *filter_values; /* list of Datum constant values being filtered */
Oid collation; /* collation for comparisons */
/* Additional info to avoid duplicate work */
List *join_rinfos; /* list of join clause RestrictInfos */
RestrictInfo *filter_rinfo; /* the filter clause RestrictInfo */
} JoinStatsMatch;and add the detection logic in clauselist_selectivity_ext() and get_foreign_key_join_selectivity().Statistics collection indeed needs the most thinking. For the
purpose of a POC, I added MCV join stats collection as part of ANALYZE
of one table (stxrel in pg_statistic_ext). I can do this because MCV
join stats are somewhat asymmetric. It allows me to have a target
table (referencing table for foreign key join) to ANALYZE, and we can
use the already collected MCVs of the joinkey column on the target
table to query the rows in the other table. This greatly mitigates
performance impact compared to actually joining two tables. However,
if we are to support more complex joins or other types of join stats
such as ndistinct or functional dependency, I found it hard to define
who's the target table (referencing table) and who's the other table
(referenced table) outside of the foreign key join scenario. So I
think for those more complex cases eventually we may as well
perform the joins and collect the join stats separately. Alvaro
Herrera suggested offline that we could have a dedicated autovacuum
command option for collecting the join statistics.
I agree, we have to perform the joins, as the rows collected are inherently dependent and skewed, and yes, it's going to be a maintenance overhead hit.
Initially I thought this could be mitigated somewhat by retaining rowsamples for each table, but those row samples would be independent of the joins, and the values we need are inherently dependent.
I have experimented with two ways to define the join statistics:1. Use CREATE STATISTICS:CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( mcv ) ] ON { table_name1.column_name1 }, { table_name1.column_name2 } [, ...] FROM table_name1 JOIN table_name2 ON table_name1.column_name3 = table_name2.column_name4
We'll need to support aliases because there could be a self-join :(
Examples:-- Create join MCV statistics on a single filter column (keyword)CREATE STATISTICS movie_keyword_keyword_join_stats (mcv)ON k.keywordFROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);ANALYZE movie_keyword;-- Create join MCV statistics on multiple filter columns (keyword + phonetic_code):CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)ON k.keyword, k.phonetic_codeFROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);ANALYZE movie_keyword;
This is where the existing CREATE STATISTICS syntax does not serve our purposes well. We definitely want MCV stats for both of those k.* columns with the skew of values that join to move_keyword on that defined foreign key, but we'd end up getting the _combinations_ of keyword, phonetic_code, which we don't necessarily care about.
We might want an alternate syntax
CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
ON keyword, phonetic_code
ON keyword, phonetic_code
[FROM movie_keyword]
USING movie_keyword_fk;
USING movie_keyword_fk;
2. Auto join stats creation for Foreign Key + Functional Dependency statsInitially, I did not implement the CREATE TABLE STATISTICS command to
create the join stats. Instead, I’ve implemented logic in ANALYZE to
detect functional dependency stats on the referenced table through FKs
and create join statistics implicitly for those cases.
pgsql-hackers by date: