Re: WIP: multivariate statistics / proof of concept - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: WIP: multivariate statistics / proof of concept |
Date | |
Msg-id | 6a0b84cf-24e3-bf28-c4ac-b3ef33c3b55f@2ndquadrant.com Whole thread Raw |
In response to | Re: WIP: multivariate statistics / proof of concept (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: WIP: multivariate statistics / proof of concept
|
List | pgsql-hackers |
On 11/21/2016 11:10 PM, Robert Haas wrote: > [ reviving an old multivariate statistics thread ] > > On Thu, Nov 13, 2014 at 6:31 AM, Simon Riggs <simon@2ndquadrant.com> wrote: >> On 12 October 2014 23:00, Tomas Vondra <tv@fuzzy.cz> wrote: >> >>> It however seems to be working sufficiently well at this point, enough >>> to get some useful feedback. So here we go. >> >> This looks interesting and useful. >> >> What I'd like to check before a detailed review is that this has >> sufficient applicability to be useful. >> >> My understanding is that Q9 and Q18 of TPC-H have poor plans as a >> result of multi-column stats errors. >> >> Could you look at those queries and confirm that this patch can >> produce better plans for them? > > Tomas, did you ever do any testing in this area? One of my > colleagues, Rafia Sabih, recently did some testing of TPC-H queries @ > 20 GB. Q18 actually doesn't complete at all right now because of an > issue with the new simplehash implementation. I reported it to Andres > and he tracked it down, but hasn't posted the patch yet - see > http://archives.postgresql.org/message-id/20161115192802.jfbec5s6ougxwicp@alap3.anarazel.de > > Of the remaining queries, the slowest are Q9 and Q20, and both of them > have serious estimation errors. On Q9, things go wrong here: > > -> Merge Join > (cost=5225092.04..6595105.57 rows=154 width=47) (actual > time=103592.821..149335.010 rows=6503988 loops=1) > Merge Cond: > (partsupp.ps_partkey = lineitem.l_partkey) > Join Filter: > (lineitem.l_suppkey = partsupp.ps_suppkey) > Rows Removed by Join Filter: 19511964 > -> Index Scan using> [snip] > > Rows Removed by Filter: 756627 > > The estimate for the index scan on partsupp is essentially perfect, > and the lineitem-part join is off by about 3x. However, the merge > join is off by about 4000x, which is real bad. > The patch only deals with statistics on base relations, no joins, at this point. It's meant to be extended in that direction, so the syntax supports it, but at this point that's all. No joins. That being said, this estimate should be improved in 9.6, when you create a foreign key between the tables. In fact, that patch was exactly about Q9. This is how the join estimate looks on scale 1 without the FK between the two tables: QUERY PLAN ----------------------------------------------------------------------- Merge Join (cost=19.19..700980.12 rows=2404 width=261) Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) -> Index Scan using idx_lineitem_part_supp on lineitem (cost=0.43..605856.84 rows=6001117width=117) -> Index Scan using partsupp_pkey on partsupp (cost=0.42..61141.76 rows=800000 width=144) (4 rows) and with the foreign key: QUERY PLAN ----------------------------------------------------------------------- Merge Join (cost=19.19..700980.12 rows=6001117 width=261) (actual rows=6001215 loops=1) Merge Cond: ((lineitem.l_partkey = partsupp.ps_partkey) AND (lineitem.l_suppkey = partsupp.ps_suppkey)) -> Index Scan using idx_lineitem_part_supp on lineitem (cost=0.43..605856.84 rows=6001117 width=117) (actual rows=6001215 loops=1) -> Index Scan using partsupp_pkeyon partsupp (cost=0.42..61141.76 rows=800000 width=144) (actual rows=6001672 loops=1)Planning time: 3.840 ms Execution time: 21987.913 ms (6 rows) > On Q20, things go wrong here:> > [snip] > > The estimate for the GroupAggregate feeding one side of the merge join > is quite accurate. The estimate for the part-partsupp join on the > other side is off by 8x. Then things get much worse: the estimate for > the merge join is off by 400x. > Well, most of the estimation error comes from the join, but sadly the aggregate makes using the foreign keys impossible - at least in the current version. I don't know if it can be improved, somehow. > I'm not really sure whether the multivariate statistics stuff will fix > this kind of case or not, but if it did it would be awesome. > Join statistics are something I'd like to add eventually, but I don't see how it could happen in the first version. Also, the patch received no reviews this CF, and making it even larger is unlikely to make it more attractive. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: