Re: Is there value in having optimizer stats for joins/foreignkeys? - Mailing list pgsql-hackers
| From | Corey Huinker |
|---|---|
| Subject | Re: Is there value in having optimizer stats for joins/foreignkeys? |
| Date | |
| Msg-id | CADkLM=frPPF_T6ym=MNU2gsWBUQ0-8XYkL4ySmORRB9R3QP6gQ@mail.gmail.com Whole thread Raw |
| In response to | Re: Is there value in having optimizer stats for joins/foreignkeys? (Tomas Vondra <tomas@vondra.me>) |
| List | pgsql-hackers |
Do I understand correctly you propose to collect such stats for every
foreign key? I recall something like that was proposed in the past, and
the argument against was that for many joins it'd be a waste because the
estimates are good enough. And for OLTP systems that's probably true.
Not every foreign key, they'd be declared like CREATE STATISTICS, but would be anchored to the constraint, not to the table.
But I always assumed we'd have a way to explicitly enable such stats for
certain joins only, and the extended stats were designed to make that
possible.
That's the intention, but the stats stored don't quite "fit" in the buckets that extended stats create. The attribute statistics seem much better suited, as this isn't about combinations, there's only ever the one combination, but rather about what can be known about the attributes in the far table before doing the actual join.
FWIW I'm not entirely sure what stats you propose to collect exactly. I
mean, what does
... associated with t1 and weighted according to the frequency of
that row being referenced, which means that values of unreferenced
rows are filtered out entirely.
mean? Are you suggesting to "do the join" and build the regular stats as
if that was a regular table? I think that'd work, and it's mostly how I
envisioned to handle joins in extended stats, restricted to joins of two
relations.
Right. We'd do the join from t1 to t2 as described earlier, and then we'd judge the null_frac, mcv, etc for each column of t2 (as defined by the scope of the stats declaration) according to the join. More commonly referenced values would show up as more frequent, hence "weighted".
Just so I have an example to refer to later, say we have a table of colors:
CREATE TABLE color(id bigint primary key, color_name text unique, color_family text null)
and there's hundreds of colors in the table that are color_family='red' ('fire engine red', 'candy apple red', 'popular muppet red', etc). Some colors don't belong to any color_family.
And we have a table of toys:
CREATE TABLE toy(id bigint primary key, min_child_age integer, name text, color_id bigint REFERENCES color)
CREATE TABLE toy(id bigint primary key, min_child_age integer, name text, color_id bigint REFERENCES color)
And we declare a join stat on toy->color for the color_family attribute. We'd sample rows from the toy table, left join those to color, and then calculate the attribute stats of color_family as if it were a column in toys. Some toys might not have a color_id, and some color_ids might not belong to a color_family, so we'd want the null_frac to reflect those combined conditions. For the values that do join, and the colors that do belong to a family, we'd want to see regular MCV stats showing "red" as the most common color_family.
But those stats aren't really a correlation or a dependency, they're just plain old attribute stats.
I understand wanting to know the correlation between toys.min_child_age and colors.color_family, so that makes perfect sense for extended statistics, but color_family on its own just doesn't fit. Am I missing something?
Combining independent per-table samples does not work, unless the
samples are huge. There's a nice paper [1] on how to do index-based join
sampling efficiently.
Thanks, now I've got some light reading for the flight home.
I think adding joins to extended stats would not be all that hard
(famous last words, I know). For me the main challenge was figuring out
how to store the join definition in the catalog, I always procrastinated
and never gave that a serious try.
I envisioned keying the stats off the foreign key constraint id, or adding "starefrelid" (relation oid of the referencing table) to pg_statistic or a table roughly the same shape as pg_statistic.
FWIW I think we might start by actually using per-table extended stats
on the joined tables. Just like we combine the scalar MCVs on joined
columns, we could combine multicolumn MVCs.
pgsql-hackers by date: