Re: Is there value in having optimizer stats for joins/foreignkeys? - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: Is there value in having optimizer stats for joins/foreignkeys?
Date
Msg-id CADkLM=fRCC3m3DJPPgYEHOJy-+932Q4RMm3Vqz7BQ_JPFASPWg@mail.gmail.com
Whole thread Raw
In response to Re: Is there value in having optimizer stats for joins/foreignkeys?  (Alexandra Wang <alexandra.wang.oss@gmail.com>)
List pgsql-hackers
I have indeed started by implementing MCV statistics for joins,
because I have not found a case for joins that would benefit only from
ndistinct or functional dependency stats that MCV stats wouldn't help.

That was a big question I had, and I agree that we should only add statistics as uses for them become apparent.
 

In my POC patch, I've made the following catalog changes:
- Add stxotherrel (oid) and stxjoinkeys (int2vector) fields to pg_statistic_ext
- Use the existing stxkeys (int2vector) to store the stats object attributes of stxotherrel
- Create pg_statistic_ext_otherrel_index on (stxrelid, stxotherrel)
- Add stxdjoinmcv (pg_join_mcv_list) to pg_statistic_ext_data

I like all these changes. Maybe "outer" rel rather than "other" rel, but it really doesn't matter this early on.
 

To use them, we can let the planner detect patterns like this:
/*
* JoinStatsMatch - Information about a detected join pattern
* Used internally to track what was matched in a join+filter pattern
*/
typedef struct JoinStatsMatch
{
Oid target_rel; /* table OID of the estimation target */
AttrNumber targetrel_joinkey; /* target_rel's join column */
Oid other_rel; /* table OID of the filter source */
AttrNumber otherrel_joinkey; /* other_rel's join column */
List *filter_attnums; /* list of AttrNumbers for filter columns in other_rel */
List *filter_values; /* list of Datum constant values being filtered */
Oid collation; /* collation for comparisons */

/* Additional info to avoid duplicate work */
List *join_rinfos; /* list of join clause RestrictInfos */
RestrictInfo *filter_rinfo; /* the filter clause RestrictInfo */
} JoinStatsMatch;
and add the detection logic in clauselist_selectivity_ext() and get_foreign_key_join_selectivity(). 

Statistics collection indeed needs the most thinking. For the
purpose of a POC, I added MCV join stats collection as part of ANALYZE
of one table (stxrel in pg_statistic_ext). I can do this because MCV
join stats are somewhat asymmetric. It allows me to have a target
table (referencing table for foreign key join) to ANALYZE, and we can
use the already collected MCVs of the joinkey column on the target
table to query the rows in the other table. This greatly mitigates
performance impact compared to actually joining two tables. However,
if we are to support more complex joins or other types of join stats
such as ndistinct or functional dependency, I found it hard to define
who's the target table (referencing table) and who's the other table
(referenced table) outside of the foreign key join scenario. So I
think for those more complex cases eventually we may as well 
perform the joins and collect the join stats separately. Alvaro
Herrera suggested offline that we could have a dedicated autovacuum
command option for collecting the join statistics.

I agree, we have to perform the joins, as the rows collected are inherently dependent and skewed, and yes, it's going to be a maintenance overhead hit.

Initially I thought this could be mitigated somewhat by retaining rowsamples for each table, but those row samples would be independent of the joins, and the values we need are inherently dependent.
 
I have experimented with two ways to define the join statistics:

1. Use CREATE STATISTICS:

CREATE STATISTICS [ [ IF NOT EXISTS ] statistics_name ] [ ( mcv ) ] ON { table_name1.column_name1 }, { table_name1.column_name2 } [, ...] FROM table_name1 JOIN table_name2 ON table_name1.column_name3 = table_name2.column_name4

We'll need to support aliases because there could be a self-join :(
 

Examples:
-- Create join MCV statistics on a single filter column (keyword)
CREATE STATISTICS movie_keyword_keyword_join_stats (mcv)
ON k.keyword 
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id); 
ANALYZE movie_keyword;

-- Create join MCV statistics on multiple filter columns (keyword + phonetic_code):
CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
ON k.keyword, k.phonetic_code
FROM movie_keyword mk JOIN keyword k ON (mk.keyword_id = k.id);
ANALYZE movie_keyword;

This is where the existing CREATE STATISTICS syntax does not serve our purposes well. We definitely want MCV stats for both of those k.* columns with the skew of values that join to move_keyword on that defined foreign key, but we'd end up getting the _combinations_ of keyword, phonetic_code, which we don't necessarily care about.

We might want an alternate syntax

CREATE STATISTICS movie_keyword_keyword_multicols_join_stats (mcv)
ON keyword, phonetic_code
[FROM movie_keyword]
USING movie_keyword_fk;

In this case, the FROM clause is redundant and therefore optional, the columns listed must exist on the confrelid and the object is keyed to the conrelid. Having said that, I think people don't really use constraint names and so the join syntax will likely be used more often.

2. Auto join stats creation for Foreign Key + Functional Dependency stats

Initially, I did not implement the CREATE TABLE STATISTICS command to
create the join stats. Instead, I’ve implemented logic in ANALYZE to
detect functional dependency stats on the referenced table through FKs
and create join statistics implicitly for those cases.

I'm not excited about this, and others have expressed concern that it would lead to an explosion of mediocre statistics objects.

pgsql-hackers by date:

Previous
From: Xuneng Zhou
Date:
Subject: Re: BUG: Cascading standby fails to reconnect after falling back to archive recovery
Next
From: Imran Zaheer
Date:
Subject: [WIP] Pipelined Recovery