Re: Is there value in having optimizer stats for joins/foreignkeys? - Mailing list pgsql-hackers
| From | Corey Huinker |
|---|---|
| Subject | Re: Is there value in having optimizer stats for joins/foreignkeys? |
| Date | |
| Msg-id | CADkLM=eAc5DCYUh4fc7eToMuPdNA7B_9fthXDsR8QVNZyuo+Dw@mail.gmail.com Whole thread Raw |
| In response to | Re: Is there value in having optimizer stats for joins/foreignkeys? (Alexandra Wang <alexandra.wang.oss@gmail.com>) |
| List | pgsql-hackers |
Current Limitations
-------------------
This is a proof of concept. Known limitations include:
I really like this proof of concept.
1. The current catalog design is not ideal. It is asymmetric (a
"primary" and an "other" table), which is natural for FK-like joins,
but less intuitive for other joins
I think the asymmetry comes with the territory, and we will be creating the join statistics that prove useful. If that means that we create one object ON b.c1, b.c2 FROM a JOIN b... and another ON a.c3, a.c4 FROM b JOIN a... then so be it.
.
2. Stats collection piggybacks on ANALYZE of the primary table and
uses its single-column MCV for the join key. This can be inaccurate
when the MCV values on the "primary" side don't cover the important
values on the other side, or when the filter column isn't fully
dependent on the join key. A more accurate approach would execute the
actual join during collection, which could also decouple join stats
collection from single-table ANALYZE.
Unfortunately, I think we will have to join the remote table_b to the row sample on table_a to get accurate join statistics, and the best time to do that when we already have the row sample from table_a. We can further batch up statistics objects that happen to join table_a to table_b by the same join criteria to avoid rescans.
Will what you have work when we want to do an MCV on a mix of local and remote columns, or will that require more work?
3. Currently limited to: equality join clauses, equality and IN filter
clauses, simple Var stats objects (no expressions), inner joins only,
and two-way joins only. Some of these are easier to extend; others may
be harder or unnecessary (like n-way joins).
I suspect that n-way joins would have very limited utility and could be adequately covered with multiple join-stats objects.
4. Patch 0002 (auto-creation from FK constraints) should probably be
gated behind a GUC. I'm not strongly attached to this patch, but kept
it because FK joins seem like a natural and common use case.
I think this is like indexing, just because you can make all possible columns indexed doesn't mean you should. Tooling will emerge to determine what join stats objects are worth their weight, and create only those objects.
If there's interest, I'm happy to continue iterating on the design.
In particular, I'd welcome feedback on:
- whether this is a direction worth pursuing,
Yes. Very much so.
- the catalog design,
I had somehow gotten the impression that you were going to take the extended statistics format, but store individual columns in a modified pg_statistic. That's working for now, but I wonder if that will still be the case when we start to try this for columns that are arrays, ranges, multiranges, tsvectors, etc. pg_statistic has the infrastructure to handle those, but there may be good reason to keep pg_statistic focused on local attributes and instead just keep adding new kinds to extended statistic and let it be the grab-bag it was perhaps always meant to be.
In my own musings on how to implement this (which you have far exceeded with this proof-of-concept), I had wondered how to stats for k.keyword and k.phonetic_code individually from the definition of movie_keywords2_multi_stats, but looking at what you've done I think we're better of with defining each statistic object very narrowly, and that means we define one object per remote column and one object per interesting combination of columns, then so be it. So long as we can calculate them all from the same join of the two tables, we'll avoid the nasty overhead.
- and scope (what kinds of joins / predicates are worth supporting).
between-ish clauses ( x>= y AND x < z), etc is what immediately comes to mind.
element_mcv for array types might be next, but that's well down the road.
pgsql-hackers by date: