Re: An Idea for planner hints - Mailing list pgsql-hackers

From Mark Dilger
Subject Re: An Idea for planner hints
Date
Msg-id 44EC7752.20807@markdilger.com
Whole thread Raw
In response to Re: An Idea for planner hints  ("Jim C. Nasby" <jnasby@pervasive.com>)
Responses Re: An Idea for planner hints
List pgsql-hackers
Jim C. Nasby wrote:
> On Tue, Aug 22, 2006 at 11:56:17AM -0700, Mark Dilger wrote:
>> I proposed something like this quite a bit up-thread.  I was hoping we 
>> could have a mode in which the system would run the second, third, fourth, 
>> ... best plans rather than just the best looking one, and then determine 
>> from actual runtime statistics which was best.  (The proposal also included 
>> the ability to output the best plan and read that in at a later time in 
>> lieu of a SQL query, but that part of it can be ignored if you like.)  The 
>> posting didn't generate much response, so I'm not sure what people thought 
>> of it.  The only major problem I see is getting the planner to keep track 
>> of alternate plans.  I don't know the internals of it very well, but I 
>> think the genetic query optimizer doesn't have a concept of "runner-up #1", 
>> "runner-up #2", etc., which it would need to have.
> 
> I think the biggest issue is that you'd have to account for varying load
> on the box. If we assume that the database is the only thing running on
> the box, we might be able to do that by looking at things like how much
> IO traffic we generated (though of course OS caching will screw with
> that).
> 
> Actually, that's another issue... any plans run after the first one will
> show up as being artificially fast, since there will be a lot of extra
> cached data.

Yes, caching issues prevent you from using wall-clock time.  We could instrument 
the code to count the number of rows vs. the number predicted for each internal 
join, from which new cost estimates could be generated.

Perhaps you can check my reasoning for me:  I'm imagining a query which computes 
AxBxCxD, where A, B, C, and D are actual tables.  I'm also imagining that the 
planner always chooses AxB first, then joins on C, then joins on D.  (It does so 
because the single-table statistics suggest this as the best course of action.)  It might be that AxD is a really small
metatable,much smaller than would be 
 
estimated from the statistics for A independent of the statistics for D, but AxB 
is pretty much what you would expect given the independent statistics for A and 
B.  So we need some way for the system to stumble upon that fact.  If we only 
ever calculate cross-join statistics for plans that the system chooses, we will 
only discover that AxB is about the size we expected it to be.  So, if the 
actual size of AxB is nearly equal to the estimated size of AxB, the system will 
continue to choose the same plan in future queries, totally ignorant of the 
advantages of doing AxD first.

That last paragraph is my reasoning for suggesting that the system have a mode 
in which it runs the "runner-up #1", "runner-up #2", etc sorts of plans.  Such a 
mode could force it down alternate paths where it might pick up interesting 
statistics that it wouldn't find otherwise.

This idea could be changed somewhat.  Rather than running the other plans, we 
could just extract from them which alternate joins they include, and consider 
also calculating those join statistics.

mark


pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [PATCHES] COPY view
Next
From: Zoltan Boszormenyi
Date:
Subject: Re: [PATCHES] COPY view