Thread: Planner mis-estimation using nested loops followup

Planner mis-estimation using nested loops followup

From
"Chris Kratz"
Date:
A number of weeks ago, I had posted a request for help regarding join estimates in pg 8.2.6.  In moderately complex to very complex ad hoc queries in our system, we were consistently having the system massively underestimate the number of rows coming out of join at a low level making these queries very slow and inefficient.  At times the mis-estimation was 1000:1.  Ie when it should have been 2000 returned rows from a join, the planner assumed 1 or 2 rows.  Modifying stats on the join columns up to the max made little difference (y, we analyzed tables in question after each change).  Since the planner sees only one row coming out of the low level join, it uses nested loops all the way up chain when it would be more efficient to use another join type.  In our informal testing, we found that by disabling nested loops and forcing other join types, we could get fantastic speedups.  Those queries that seem to benefit most from this have a lot of sub-queries being built up into a final query set as well as a fair number of joins in the sub-queries.  Since these are user created and are then generated via our tools, they can be quite messy at times.  

After doing this testing, have since added some functionality in our ad hoc reporting tool to allow us to tune individual queries by turning on and off individual join types at runtime.  As we hear of slow reports, we've been individually turning off the nested loops on those reports.  Almost always, this has increased the performance of the reports, sometimes in a completely amazing fashion (many, many minutes to seconds at times).  It of course doesn't help everything and turning off nested loops in general causes overall slowdown in other parts of the system.

As this has gone on over the last couple of weeks, it feels like we either have a misconfiguration on the server, or we are tickling a mis-estimation bug in the planner.  I'm hoping it's the former.  The db server has 8G of memory and raid1 -wal, raid10- data configuration, os is linux 2.6.9, db is 8.2.6.  The db is a utf-8 db if that is of any bearing and autovac and bgwriter are on.

Nondefault settings of interest from postgresql.conf

 
shared_buffers = 1024MB                 # min 128kB or max_connections*16kB
work_mem = 256MB                                # min 64kB
maintenance_work_mem = 256MB            # min 1MB
random_page_cost = 1.75                 # same scale as above
effective_cache_size = 4096MB
default_statistics_target = 100         # range 1-1000

 
If nothing else, perhaps this will help somebody else who has run into the same problem.  If explain analyze of a query shows a large mis-estimation of rows returned on a join (estimate=1, actual=2k) causing the planner to choose nested loops instead of another join type, you might try running the query with nested loops set to off and see if that helps w/ performance.

Thanks,

-Chris

Re: Planner mis-estimation using nested loops followup

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Tue, 18 Mar 2008 11:35:08 -0400
"Chris Kratz" <chris.kratz@vistashare.com> wrote:

> Nondefault settings of interest from postgresql.conf
> 
> 
> shared_buffers = 1024MB                 # min 128kB or
> max_connections*16kB work_mem = 256MB
> # min 64kB maintenance_work_mem = 256MB            # min 1MB
> random_page_cost = 1.75                 # same scale as above
> effective_cache_size = 4096MB
> default_statistics_target = 100         # range 1-1000
> 
> 
> If nothing else, perhaps this will help somebody else who has run
> into the same problem.  If explain analyze of a query shows a large
> mis-estimation of rows returned on a join (estimate=1, actual=2k)
> causing the planner to choose nested loops instead of another join
> type, you might try running the query with nested loops set to off
> and see if that helps w/ performance.

Did you try that? Did it work?

Joshua D. Drake


- -- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
      PostgreSQL political pundit | Mocker of Dolphins

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH3+TlATb/zqfZUUQRAmXUAKCjwidfW0KXjzUM26I4yTx94/wSiQCfaqWU
eI9i5yucBH718okW3w2UewQ=
=BO3E
-----END PGP SIGNATURE-----

Re: Planner mis-estimation using nested loops followup

From
"Chris Kratz"
Date:
Y, turning nested loops off in specific cases has increased performance greatly.  It didn't fix the planner mis-estimation, just the plan it chose.  It's certainly not a panacea, but it's something we now try early on when trying to speed up a query that matches these characteristics.

-Chris

On 3/18/08, Joshua D. Drake <jd@commandprompt.com> wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


On Tue, 18 Mar 2008 11:35:08 -0400
"Chris Kratz" <chris.kratz@vistashare.com> wrote:

> Nondefault settings of interest from postgresql.conf
>
>
> shared_buffers = 1024MB                 # min 128kB or
> max_connections*16kB work_mem = 256MB
> # min 64kB maintenance_work_mem = 256MB            # min 1MB
> random_page_cost = 1.75                 # same scale as above
> effective_cache_size = 4096MB
> default_statistics_target = 100         # range 1-1000
>
>
> If nothing else, perhaps this will help somebody else who has run
> into the same problem.  If explain analyze of a query shows a large
> mis-estimation of rows returned on a join (estimate=1, actual=2k)
> causing the planner to choose nested loops instead of another join
> type, you might try running the query with nested loops set to off
> and see if that helps w/ performance.


Did you try that? Did it work?

Joshua D. Drake


- --
The PostgreSQL Company since 1997: http://www.commandprompt.com/
PostgreSQL Community Conference: http://www.postgresqlconference.org/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
      PostgreSQL political pundit | Mocker of Dolphins

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFH3+TlATb/zqfZUUQRAmXUAKCjwidfW0KXjzUM26I4yTx94/wSiQCfaqWU
eI9i5yucBH718okW3w2UewQ=
=BO3E
-----END PGP SIGNATURE-----

 

Re: Planner mis-estimation using nested loops followup

From
Matthew
Date:
On Tue, 18 Mar 2008, Chris Kratz wrote:
> In moderately complex to very complex ad hoc queries in our system, we
> were consistently having the system massively underestimate the number
> of rows coming out of join at a low level making these queries very slow
> and inefficient.

I have long thought that perhaps Postgres should be a little more cautious
about its estimates, and assume the worst case scenario sometimes, rather
than blindly following the estimates from the statistics. The problem is
that Postgres uses the statistics to generate best estimates of the cost.
However, it does not take into account the consequences of being wrong. If
it was more clever, then it may be able to decide to use a non-optimal
algorithm according to the best estimate, if the optimal algorithm has the
possibility of blowing up to 1000 times the work if the estimates are off
by a bit.

Such cleverness would be very cool, but (I understand) a lot of work. It
would hopefully solve this problem.

Matthew

--
<Taking apron off> And now you can say honestly that you have been to a
lecture where you watched paint dry.
         - Computer Graphics Lecturer

Re: Planner mis-estimation using nested loops followup

From
"Scott Marlowe"
Date:
On Tue, Mar 18, 2008 at 9:58 AM, Chris Kratz <chris.kratz@vistashare.com> wrote:
> Y, turning nested loops off in specific cases has increased performance
> greatly.  It didn't fix the planner mis-estimation, just the plan it chose.
> It's certainly not a panacea, but it's something we now try early on when
> trying to speed up a query that matches these characteristics.

I have to admit I've had one or two reporting queries in the past that
turning off nested_loop was the only reasonable fix due to
misestimation.  I'd tried changing the stats targets etc and nothing
really worked reliably to prevent the nested_loop from showing up in
the wrong places.

Re: Planner mis-estimation using nested loops followup

From
"Stephen Denne"
Date:
Scott Marlowe wrote
> On Tue, Mar 18, 2008 at 9:58 AM, Chris Kratz
> <chris.kratz@vistashare.com> wrote:
> > Y, turning nested loops off in specific cases has increased
> performance
> > greatly.  It didn't fix the planner mis-estimation, just
> the plan it chose.
> > It's certainly not a panacea, but it's something we now try
> early on when
> > trying to speed up a query that matches these characteristics.
>
> I have to admit I've had one or two reporting queries in the past that
> turning off nested_loop was the only reasonable fix due to
> misestimation.  I'd tried changing the stats targets etc and nothing
> really worked reliably to prevent the nested_loop from showing up in
> the wrong places.

One cause of planner mis-estimation I've seen quite frequently is when there are a number of predicates on the data
thatfilter the results in roughly the same manner. PostgreSQL, not knowing that the filters are highly correlated,
multipliesthe "fraction of selected rows" together. 

Making up an example using pseudo-code, if this is one of the subqueries:

select * from orders where
order_date is recent
and
order_fulfilled is false

Used in an application where the unfulfilled orders are the recent ones.

If postgresql estimates that 1% of the orders are recent, and 1% are unfulfilled, then it will assume that 0.01% are
bothrecent and unfulfilled. If in reality it's more like 0.9%, and your actual row count will be 90 times your
estimate.

The only kind of simple behind-the-scenes fix for these situations that I know of is to add more indexes (such as a
partialindex on order_date where order_fulfilled is false), which slows down all your updates, and only works for the
simplestsituations. 

A general fix would need to calculate, store, and lookup a huge amount of correlation data. Probably equal to the
squareof the number of rows in pg_stats, though this could possibly be generated as needed. 

Perhaps if the analyze command was extended to be able to take a command line like:
ANALYZE CARTESIAN CORRELATION orders(order_date,order_fulfilled);
which stores the fraction for each combination of most frequent value, and domain buckets from order_date and
order_fulfilled.
The difficulty is whether the planner can quickly and easily determine whether appropriate correlation data exists for
thequery plan it is estimating. 

Regards,
Stephen Denne.

Disclaimer:
At the Datamail Group we value team commitment, respect, achievement, customer focus, and courage. This email with any
attachmentsis confidential and may be subject to legal privilege.  If it is not intended for you please advise by reply
immediately,destroy it and do not copy, disclose or use it in any way. 
__________________________________________________________________
  This email has been scanned by the DMZGlobal Business Quality
              Electronic Messaging Suite.
Please see http://www.dmzglobal.com/dmzmessaging.htm for details.
__________________________________________________________________



Re: Planner mis-estimation using nested loops followup

From
KC ESL
Date:
At 00:24 08/03/19, Matthew wrote:
>On Tue, 18 Mar 2008, Chris Kratz wrote:
>>In moderately complex to very complex ad hoc queries in our system,
>>we were consistently having the system massively underestimate the
>>number of rows coming out of join at a low level making these
>>queries very slow and inefficient.
>
>I have long thought that perhaps Postgres should be a little more
>cautious about its estimates, and assume the worst case scenario
>sometimes, rather than blindly following the estimates from the
>statistics. The problem is that Postgres uses the statistics to
>generate best estimates of the cost. However, it does not take into
>account the consequences of being wrong. If it was more clever, then
>it may be able to decide to use a non-optimal algorithm according to
>the best estimate, if the optimal algorithm has the possibility of
>blowing up to 1000 times the work if the estimates are off by a bit.
>
>Such cleverness would be very cool, but (I understand) a lot of
>work. It would hopefully solve this problem.
>
>Matthew

Just a crazy thought. If Postgres could check its own estimates or
set some limits while executing the query and, if it found that the
estimates were way off, fall back to a less optimal plan immediately
or the next time, that would be cool.

KC