Re: Implement missing join selectivity estimation for range types - Mailing list pgsql-hackers
From | Mahmoud Sakr |
---|---|
Subject | Re: Implement missing join selectivity estimation for range types |
Date | |
Msg-id | CAB4o4asvPN=NT7KvS9zVQjZbdsiRW5t8aQctEkW7mxc4hbBxVQ@mail.gmail.com Whole thread Raw |
In response to | Re: Implement missing join selectivity estimation for range types (Tomas Vondra <tomas.vondra@enterprisedb.com>) |
Responses |
Re: Implement missing join selectivity estimation for range types
|
List | pgsql-hackers |
Hi Tomas, > I finally had time to properly read the paper today - the general > approach mostly matches how I imagined the estimation would work for > inequalities, but it's definitely nice to see the algorithm properly > formalized and analyzed. Awesome, thanks for this interest! > What seems a bit strange to me is that the patch only deals with range > types, leaving the scalar cases unchanged. I understand why (not having > a MCV simplifies it a lot), but I'd bet joins on range types are waaaay > less common than inequality joins on scalar types. I don't even remember > seeing inequality join on a range column, TBH. > > That doesn't mean the patch is wrong, of course. But I'd expect users to > be surprised we handle range types better than "old" scalar types (which > range types build on, in some sense). > > Did you have any plans to work on improving estimates for the scalar > case too? Or did you do the patch needed for the paper, and have no > plans to continue working on this? I fully agree. Scalars are way more important. I join you in the call for Diogo and Zhicheng to continue this implementation, specially after the interest you show towards this patch. The current patch was a course project (taught by me and Maxime), which was specific to range types. But indeed the solution generalizes well to scalars. Hopefully after the current exam session (Feb), there will be time to continue the implementation. Nevertheless, it makes sense to do two separate patches: this one, and the scalars. The code of the two patches is located in different files, and the estimation algorithms have slight differences. > I'm also wondering about not having MCV for ranges. I was a bit > surprised we don't build MCV in compute_range_stats(), and perhaps we > should start building those - if there are common ranges, this might > significantly improve some of the estimates (just like for scalar > columns). Which would mean the estimates for range types are just as > complex as for scalars. Of course, we don't do that now. Good question. Our intuition is that MCV will not be useful for ranges. Maxime has done an experiment and confirmed this intuition. Here is his experiment and explanation: Create a table with 126000 int4ranges. All ranges have their middle between 0 and 1000 and a length between 90 and 110. The ranges are created in the following way: - 10 different ranges, each duplicated 1000 times - 20 different ranges, each duplicated 500 times - 40 different ranges, each duplicated 100 times - 200 different ranges, each duplicated 10 times - 100000 different ranges, not duplicated Two such tables (t1 and t2) were created in the same way but with different data. Then he ran the following query: EXPLAIN ANALYZE SELECT count(*) FROM t, t2 WHERE t && t2 The results (using our patch) where the following: Plan rows: 2991415662 Actual rows: 2981335423 So the estimation accuracy is still fairly good for such data with a lot of repeated values, even without having MCV statistics. The only error that can happen in our algorithms comes from the last bin in which we assume uniform distribution of values. Duplicate values (end bound, not ranges) might make this assumption be wrong, which would create inaccurate estimations. However, this is still only incorrect for the last bin and all the others are correct. MCV's are mainly useful for equality, which is not an operation we cover, and probably not an important predicate for ranges. What do you think? Best regards, Mahmoud
pgsql-hackers by date: