Home > mailing lists

Re: reducing random_page_cost from 4 to 2 to force index scan - Mailing list pgsql-performance

From	Cédric Villemain
Subject	Re: reducing random_page_cost from 4 to 2 to force index scan
Date	May 19, 2011 19:28:04
Msg-id	BANLkTikiQmj1z_VtdcGxw6_v-tdnPuwSug@mail.gmail.com Whole thread Raw
In response to	Re: reducing random_page_cost from 4 to 2 to force index scan (Jim Nasby <jim@nasby.net>)
List	pgsql-performance

Tree view

2011/5/19 Jim Nasby <jim@nasby.net>:
> On May 19, 2011, at 9:53 AM, Robert Haas wrote:
>> On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>> Jim Nasby wrote:
>>>> I think the challenge there would be how to define the scope of the
>>>> hot-spot. Is it the last X pages? Last X serial values? Something like
>>>> correlation?
>>>>
>>>> Hmm... it would be interesting if we had average relation access times for
>>>> each stats bucket on a per-column basis; that would give the planner a
>>>> better idea of how much IO overhead there would be for a given WHERE clause
>>>
>>> You've already given one reasonable first answer to your question here.  If
>>> you defined a usage counter for each histogram bucket, and incremented that
>>> each time something from it was touched, that could lead to a very rough way
>>> to determine access distribution.  Compute a ratio of the counts in those
>>> buckets, then have an estimate of the total cached percentage; multiplying
>>> the two will give you an idea how much of that specific bucket might be in
>>> memory.  It's not perfect, and you need to incorporate some sort of aging
>>> method to it (probably weighted average based), but the basic idea could
>>> work.
>>
>> Maybe I'm missing something here, but it seems like that would be
>> nightmarishly slow.  Every time you read a tuple, you'd have to look
>> at every column of the tuple and determine which histogram bucket it
>> was in (or, presumably, which MCV it is, since those aren't included
>> in working out the histogram buckets).  That seems like it would slow
>> down a sequential scan by at least 10x.
>
> You definitely couldn't do it real-time. But you might be able to copy the tuple somewhere and have a background
processdo the analysis. 
>
> That said, it might be more productive to know what blocks are available in memory and use correlation to guesstimate
whethera particular query will need hot or cold blocks. Or perhaps we create a different structure that lets you track
thedistribution of each column linearly through the table; something more sophisticated than just using correlation....
perhapssomething like indicating which stats bucket was most prevalent in each block/range of blocks in a table. That
informationwould allow you to estimate exactly what blocks in the table you're likely to need... 

Those are very good ideas I would get in mind for vacuum/checkpoint
tasks: if you are able to know hot and cold data, then order it in the
segments of the relation. But making it work at the planner level
looks hard. I am not opposed to the idea, but no idea how to do it
right now.

> --
> Jim C. Nasby, Database Architect                   jim@nasby.net
> 512.569.9461 (cell)                         http://jim.nasby.net
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

pgsql-performance by date:

From: Samuel Gendler
Date: 19 May 2011, 18:41:31
Subject: Re: SORT performance - slow?

From: "Kevin Grittner"
Date: 20 May 2011, 00:52:22
Subject: Re: SORT performance - slow?

Re: reducing random_page_cost from 4 to 2 to force index scan - Mailing list pgsql-performance

Previous

Next