Re: Planner mis-estimation using nested loops followup - Mailing list pgsql-performance

From Stephen Denne
Subject Re: Planner mis-estimation using nested loops followup
Date
Msg-id F0238EBA67824444BC1CB4700960CB4804EAC54F@dmpeints002.isotach.com
Whole thread Raw
In response to Re: Planner mis-estimation using nested loops followup  ("Scott Marlowe" <scott.marlowe@gmail.com>)
List pgsql-performance
Scott Marlowe wrote
> On Tue, Mar 18, 2008 at 9:58 AM, Chris Kratz
> <chris.kratz@vistashare.com> wrote:
> > Y, turning nested loops off in specific cases has increased
> performance
> > greatly.  It didn't fix the planner mis-estimation, just
> the plan it chose.
> > It's certainly not a panacea, but it's something we now try
> early on when
> > trying to speed up a query that matches these characteristics.
>
> I have to admit I've had one or two reporting queries in the past that
> turning off nested_loop was the only reasonable fix due to
> misestimation.  I'd tried changing the stats targets etc and nothing
> really worked reliably to prevent the nested_loop from showing up in
> the wrong places.

One cause of planner mis-estimation I've seen quite frequently is when there are a number of predicates on the data
thatfilter the results in roughly the same manner. PostgreSQL, not knowing that the filters are highly correlated,
multipliesthe "fraction of selected rows" together. 

Making up an example using pseudo-code, if this is one of the subqueries:

select * from orders where
order_date is recent
and
order_fulfilled is false

Used in an application where the unfulfilled orders are the recent ones.

If postgresql estimates that 1% of the orders are recent, and 1% are unfulfilled, then it will assume that 0.01% are
bothrecent and unfulfilled. If in reality it's more like 0.9%, and your actual row count will be 90 times your
estimate.

The only kind of simple behind-the-scenes fix for these situations that I know of is to add more indexes (such as a
partialindex on order_date where order_fulfilled is false), which slows down all your updates, and only works for the
simplestsituations. 

A general fix would need to calculate, store, and lookup a huge amount of correlation data. Probably equal to the
squareof the number of rows in pg_stats, though this could possibly be generated as needed. 

Perhaps if the analyze command was extended to be able to take a command line like:
ANALYZE CARTESIAN CORRELATION orders(order_date,order_fulfilled);
which stores the fraction for each combination of most frequent value, and domain buckets from order_date and
order_fulfilled.
The difficulty is whether the planner can quickly and easily determine whether appropriate correlation data exists for
thequery plan it is estimating. 

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any
attachmentsis confidential and may be subject to legal privilege.  If it is not intended for you please advise by reply
immediately,destroy it and do not copy, disclose or use it in any way. 
__________________________________________________________________
  This email has been scanned by the DMZGlobal Business Quality
              Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________



pgsql-performance by date:

Previous
From: "Mark Steben"
Date:
Subject: question on TRUNCATE vs VACUUM FULL
Next
From: Craig Ringer
Date:
Subject: Re: What is the best way to storage music files in Postgresql