As our queries are generated I decided to create a peephole optimizer kind of thing to scan the generated SQL AST to find multiple conditions on the same table reference. I can then use our metadata to see if these references are expected to be correlated. This creates about 20 statistics sets, including the one you have indicated. This at least makes the problematic query have a stable and very fast plan (so far). I had hoped for some more improvement with other queries but that has not yet been evident ;)
On Mon, Dec 19, 2022 at 09:10:27PM +0100, Frits Jalvingh wrote: > @justin > > I tried the create statistics variant and that definitely improves the > estimate, and with that one of the "bad" cases (the one with the 82 minute > plan) now creates a good plan using only a few seconds. > That is a worthwhile path to follow. A bit hard to do, because those > conditions can be anything, but I can probably calculate the ones used per > customer and create those correlation statistics from that... It is > definitely better than tweaking the "poor man's query hints" enable_xxxx > 8-/ which is really not helping with plan stability either. > > That will be a lot of work, but I'll let you know the results ;)
Yeah, if the conditions are arbitrary, then it's going to be more difficult. Hopefully you don't have too many columns. :)
I suggest enabling autoexplain and monitoring for queries which were slow, and retroactively adding statistics to those columns which are most-commonly queried, and which have correlations (which the planner doesn't otherwise know about).
You won't want to have more than a handful of columns in a stats object (since it requires factorial(N) complexity), but you can have multiple stats objects with different combinations of columns (and, in v14, expressions). You can also set a lower stats target to make the cost a bit lower.
You could try to check which columns are correlated, either by running: | SELECT COUNT(1),col1,col2 FROM tbl GROUP BY 2,3 ORDER BY 1; for different combinations of columns.
Or by creating a tentative/experimental stats object on a handful of columns at a time for which you have an intuition about their correlation, and then checking the calculated dependencies FROM pg_stats_ext. You may need to to something clever to use that for arbitrarily columns. Maybe this is a start. | SELECT dep.value::float, tablename, attnames, dep.key, exprs FROM (SELECT (json_each_text(dependencies::text::json)).* AS dep, * FROM pg_stats_ext)dep WHERE dependencies IS NOT NULL ORDER BY 1 DESC ; -- AND regexp_count(key, ',') < 2