Re: Planner mis-estimation using nested loops followup - Mailing list pgsql-performance
From | Stephen Denne |
---|---|
Subject | Re: Planner mis-estimation using nested loops followup |
Date | |
Msg-id | F0238EBA67824444BC1CB4700960CB4804EAC54F@dmpeints002.isotach.com Whole thread Raw |
In response to | Re: Planner mis-estimation using nested loops followup ("Scott Marlowe" <scott.marlowe@gmail.com>) |
List | pgsql-performance |
Scott Marlowe wrote > On Tue, Mar 18, 2008 at 9:58 AM, Chris Kratz > <chris.kratz@vistashare.com> wrote: > > Y, turning nested loops off in specific cases has increased > performance > > greatly. It didn't fix the planner mis-estimation, just > the plan it chose. > > It's certainly not a panacea, but it's something we now try > early on when > > trying to speed up a query that matches these characteristics. > > I have to admit I've had one or two reporting queries in the past that > turning off nested_loop was the only reasonable fix due to > misestimation. I'd tried changing the stats targets etc and nothing > really worked reliably to prevent the nested_loop from showing up in > the wrong places. One cause of planner mis-estimation I've seen quite frequently is when there are a number of predicates on the data thatfilter the results in roughly the same manner. PostgreSQL, not knowing that the filters are highly correlated, multipliesthe "fraction of selected rows" together. Making up an example using pseudo-code, if this is one of the subqueries: select * from orders where order_date is recent and order_fulfilled is false Used in an application where the unfulfilled orders are the recent ones. If postgresql estimates that 1% of the orders are recent, and 1% are unfulfilled, then it will assume that 0.01% are bothrecent and unfulfilled. If in reality it's more like 0.9%, and your actual row count will be 90 times your estimate. The only kind of simple behind-the-scenes fix for these situations that I know of is to add more indexes (such as a partialindex on order_date where order_fulfilled is false), which slows down all your updates, and only works for the simplestsituations. A general fix would need to calculate, store, and lookup a huge amount of correlation data. Probably equal to the squareof the number of rows in pg_stats, though this could possibly be generated as needed. Perhaps if the analyze command was extended to be able to take a command line like: ANALYZE CARTESIAN CORRELATION orders(order_date,order_fulfilled); which stores the fraction for each combination of most frequent value, and domain buckets from order_date and order_fulfilled. The difficulty is whether the planner can quickly and easily determine whether appropriate correlation data exists for thequery plan it is estimating. Regards, Stephen Denne. Disclaimer: At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any attachmentsis confidential and may be subject to legal privilege. If it is not intended for you please advise by reply immediately,destroy it and do not copy, disclose or use it in any way. __________________________________________________________________ This email has been scanned by the DMZGlobal Business Quality Electronic Messaging Suite. Please see http://www.dmzglobal.com/dmzmessaging.htm for details. __________________________________________________________________
pgsql-performance by date: