Re: Implement missing join selectivity estimation for range types - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Implement missing join selectivity estimation for range types |
Date | |
Msg-id | 8afecd87-d1e5-241c-5e3e-75e1c62c279b@enterprisedb.com Whole thread Raw |
In response to | Implement missing join selectivity estimation for range types (Mahmoud Sakr <mahmoud.sakr@ulb.be>) |
Responses |
Re: Implement missing join selectivity estimation for range types
|
List | pgsql-hackers |
Hello Mahmoud, Thanks for the patch and sorry for not taking a look earlier. On 6/30/22 16:31, Mahmoud Sakr wrote: > Hi, > Given a query: > SELECT * FROM t1, t2 WHERE t1.r << t2.r > where t1.r, t2.r are of range type, > currently PostgreSQL will estimate a constant selectivity for the << predicate, > which is equal to 0.005, not utilizing the statistics that the optimizer > collects for range attributes. > > We have worked out a theory for inequality join selectivity estimation > (http://arxiv.org/abs/2206.07396), and implemented it for range > types it in this patch. > Interesting. Are there any particular differences compared to how we estimate for example range clauses on regular columns? > The algorithm in this patch re-uses the currently collected statistics for > range types, which is the bounds histogram. It works fairly accurate for the > operations <<, >>, &&, &<, &>, <=, >= with estimation error of about 0.5%. Right. I think 0.5% is roughly expected for the default statistics target, which creates 100 histogram bins, each representing ~1% of the values. Which on average means ~0.5% error. > The patch also implements selectivity estimation for the > operations @>, <@ (contains and is contained in), but their accuracy is not > stable, since the bounds histograms assume independence between the range > bounds. A point to discuss is whether or not to keep these last two operations. That's a good question. I think the independence assumption is rather foolish in this case, so I wonder if we could "stabilize" this by making some different - less optimistic - assumption. Essentially, we have an estimates for lower/upper boundaries: P1 = P(lower(var1) <= lower(var2)) P2 = P(upper(var2) <= upper(var1)) and independence means we take (P1*P2). But maybe we should be very pessimistic and use e.g. Min(P1,P2)? Or maybe something in between? Another option is to use the length histogram, right? I mean, we know what the average length is, and it should be possible to use that to calculate how "far" ranges in a histogram can overlap. > The patch also includes the selectivity estimation for multirange types, > treating a multirange as a single range which is its bounding box. > OK. But ideally we'd cross-check elements of the two multiranges, no? > The same algorithm in this patch is applicable to inequality joins of scalar > types. We, however, don't implement it for scalars, since more work is needed > to make use of the other statistics available for scalars, such as the MCV. > This is left as a future work. > So if the column(s) contain a couple very common (multi)ranges that make it into an MCV, we'll ignore those? That's a bit unfortunate, because those MCV elements are potentially the main contributors to selectivity. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: