Hello,
Here is a patch to improve rowcount estimates for
`UNNEST(some_array_column)`. Today we hard code this to 10, but we
have statistics about array size, so it's easy to use them.
I've seen plans where this would make a difference. If the array has
only 1 or 2 elements, then overestimating the rowcount by 10 leads to
unnecessary seqscans downstream. I can see how an underestimate would
cause issues too.
This patch builds on a391ff3c3d41 which allowed set-returning
functions like UNNEST to include a support function to estimate their
result count. (There is a nice writeup at
https://www.cybertec-postgresql.com/en/optimizer-support-functions/)
But that patch only changes UNNEST if it has a Const or ArrayExpr
argument.
The statistic I'm using is the last value in the DECHIST array, which
is the average number of distinct elements in the array. Using the
plain (non-distinct) element count would be more accurate, but we
don't have that, and using distinct elements is still better than a
hardcoded 10.
The real change is in estimate_array_length, which has several callers
besides array_unnest_support, but I think this change should give more
accurate estimates for all of them.
There is a comment that estimate_array_length must agree with
scalararraysel. I don't think this commit introduces any
discrepancies. The most relevant case there is `scalar = ANY/ALL
(array)`, which also consults DECHIST (and/or MCELEM).
I wasn't sure where to put a test. I finally settled on arrays.sql
since (1) that has other UNNEST tests (2) array_unnest_support is in
util/adt/arrayfuncs.c (3) I couldn't find a place devoted to
rowcount/selectivity estimates. I'm happy to move it if someone has a
better idea!
Based on 712dc2338b23.
Yours,
--
Paul ~{:-)
pj@illuminatedcomputing.com