Re: SeqScan costs - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: SeqScan costs
Date
Msg-id 1218622010.5343.283.camel@ebony.2ndQuadrant
Whole thread Raw
In response to Re: SeqScan costs  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Tue, 2008-08-12 at 23:22 -0400, Tom Lane wrote:
> Gregory Stark <stark@enterprisedb.com> writes:
> >> On Tue, 2008-08-12 at 15:46 -0400, Tom Lane wrote:
> >>> This is only going to matter for a table of 1 block (or at least very
> >>> few blocks), and for such a table it's highly likely that it's in RAM
> >>> anyway.  So I'm unconvinced that the proposed change represents a
> >>> better model of reality.
> 
> > I think the first block of a sequential scan is clearly a random access. If
> > that doesn't represent reality well then perhaps we need to tackle both
> > problems together.
> 
> The point I was trying to make (evidently not too well) is that fooling
> around with fundamental aspects of the cost models is not something that
> should be done without any evidence.  We've spent ten years getting the
> system to behave reasonably well with the current models, and it's quite
> possible that changing them to be "more accurate" according to a
> five-minute analysis is going to make things markedly worse overall.
> 
> I'm not necessarily opposed to making this change --- it does sound
> kinda plausible --- but I want to see some hard evidence that it does
> more good than harm before we put it in.

psql -f seq.sql -v numblocks=2 -v pkval=Anything -v filler=Varies

When numblocks=2 I consistently see that an index scan is actually
faster than a seqscan, yet the planner chooses a seqscan in all cases. 

This is true for any value of pkval and values of filler up to 4-500
bytes. We already take into account the length of rows because we
estimate the CPU costs per row not per block. That is not what I wish to
change.

This same situation occurs for all small tables. What I conclude is that
the "disk cost" swamps the CPU costs and so we end up with a seq scan
when we really want an index scan.

There are two ways of looking at this
* we work out a complex scheme for knowing when to remove disk costs
* we realise that the "disk cost" is actually the same on the *first*
block whether we are in memory or on disk. 

If we take the second way, then we have a small but crucial correction
factor that produces better plans in most cases on small tables. Doing
it this way allows us to not worry about the cacheing, but just have a
scheme that balances the access costs better so that although they are
still present in the total cost the final plan choice is less dependent
upon the disk cost and more dependent upon the CPU costs.

This analysis is the result of experience, then measurement, not theory.
I've been looking for an easy and justifiable way to nudge the cost
factors so that they work better for small tables.
 run_cost += random_page_cost + seq_page_cost * (baserel->pages - 1);


> > People lower random_page_cost because we're not doing a good job estimating
> > how much of a table is in cache.
> 
> Agreed, the elephant in the room is that we lack enough data to model
> caching effects with any degree of realism.

I'm specifically talking about a proposal that works whether or not the
first block of the table is in cache, because I see a problem with small
table access.

I'm not suggesting that we model cacheing effects (though we may choose
to later). If you did, you might need to consider cross-statement
effects such as the likelihood that a UPDATE .. WHERE CURRENT OF CURSOR
is more likely to find the block in cache, or other effects such as
certain MFVs might actually be more likely to be in cache than non-MFVs
and so index scans against them are actually more preferable than it
might appear.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



pgsql-hackers by date:

Previous
From: Markus Wanner
Date:
Subject: Re: Transaction-controlled robustness for replication
Next
From: "Stephen R. van den Berg"
Date:
Subject: Re: Replay attack of query cancel