Re: Erroneous cost estimation for nested loop join - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Erroneous cost estimation for nested loop join
Date
Msg-id CA+TgmoYkNQXRcRZekNK=qVQkgfp5zD+1_7vYa7fDASCx8hiqzA@mail.gmail.com
Whole thread Raw
In response to Re: Erroneous cost estimation for nested loop join  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: Erroneous cost estimation for nested loop join
List pgsql-hackers
On Mon, Nov 16, 2015 at 6:50 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Nov 9, 2015 at 6:37 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> kawamichi@tkl.iis.u-tokyo.ac.jp writes:
>>>       - cost parameter calibration: random_page_cost = 92.89
>>
>> TBH, you lost me there already.  I know of no hardware on which that would
>> be a sane depiction of reality, so I think you've probably overfitted the
>> model to some particular case it was already inaccurate on.
>
> I can easily get a ratio of random to sequential of 50, and my RAID is
> nothing special.  I don't see why a high-end RAID couldn't justify 100
> or more, as long as they know the cache-hit is very low. (The default
> value of 4 seems to bake in the notion that 90% of even random IO is
> going to be satisfied from the cache, which might be a good estimate
> if you have frequently used smallish lookup tables that always get
> joined to the RAM-busters, but some people aren't going to have that
> type of database queries as their main load).

I agree.  What I've been thinking about is:

- If we're sequential scanning a small table, let's say less than 1/4
of shared_buffers, which is the point where synchronized scans kick
in, then assume the data is coming from shared_buffers.
- If we're scanning a medium-sized table, let's say less than
effective_cache_size, then assume the data is coming from the OS
cache.  Maybe this is the same cost as the previous case, or maybe
it's slightly more.
- Otherwise, assume that the first effective_cache_size pages are
coming from cache and the rest has to be read from disk.  This is
perhaps unrealistic, but we don't want the cost curve to be
discontinuous.

The effect of this would be that sequential scanning a small table
would look cheaper per page than sequential scanning a large table,
which is a good idea, because it's likely to be true.  Similar
adaptations could be made for index scans, index-only scans, bitmap
index scans, and bitmap heap scans.  The default value of
random_page_cost would go up, but there would be a new
cached_page_cost GUC that would apply in some cases where
seq_page_cost and/or random_page_cost apply today.

Previous attempts to improve the modeling of cache effects have
faltered because we don't know what will be cached at execution time,
and even if we did it can change very quickly and we don't want plan
instability.  But it seems to me that guessing based on the size of
the relation is pretty reasonable - essentially we'd be trying to
refine a model which says that every page is equally likely to be
cached, and that shouldn't be too high a bar to clear.

A problem with this sort of thing, of course, is that it's really hard
to test a proposed change broadly enough to be certain how it will
play out in the real world.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Default Roles (was: Additional role attributes)
Next
From: Alvaro Herrera
Date:
Subject: Re: Question concerning XTM (eXtensible Transaction Manager API)