Thread: reducing random_page_cost from 4 to 2 to force index scan

reducing random_page_cost from 4 to 2 to force index scan

From

Sok Ann Yap

Date:

26 April 2011, 17:15:47

Hi,

I am using PostgreSQL 9.0. There is a salutations table with 44 rows,
and a contacts table with more than a million rows. The contacts table
has a nullable (only 0.002% null) salutation_id column, referencing
salutations.id.

With this query:

SELECT
    salutations.id,
    salutations.name,
    salutations.description,
    EXISTS (
        SELECT 1
        FROM contacts
        WHERE salutations.id = contacts.salutation_id
    ) AS in_use
FROM salutations

I have to reduce random_page_cost from 4 to 2 to force index scan.

EXPLAIN ANALYSE output with random_page_cost = 4:

 Seq Scan on salutations  (cost=0.00..50.51 rows=44 width=229) (actual
time=0.188..3844.037 rows=44 loops=1)
   SubPlan 1
     ->  Seq Scan on contacts  (cost=0.00..64578.41 rows=57906
width=0) (actual time=87.358..87.358 rows=1 loops=44)
           Filter: ($0 = salutation_id)
 Total runtime: 3844.113 ms

EXPLAIN ANALYSE output with random_page_cost = 4, enable_seqscan = 0:

 Seq Scan on salutations  (cost=10000000000.00..10000000095.42 rows=44
width=229) (actual time=0.053..0.542 rows=44 loops=1)
   SubPlan 1
     ->  Index Scan using ix_contacts_salutation_id on contacts
(cost=0.00..123682.07 rows=57906 width=0) (actual time=0.011..0.011
rows=1 loops=44)
           Index Cond: ($0 = salutation_id)
 Total runtime: 0.592 ms

EXPLAIN ANALYSE output with random_page_cost = 2:

 Seq Scan on salutations  (cost=0.00..48.87 rows=44 width=229) (actual
time=0.053..0.541 rows=44 loops=1)
   SubPlan 1
     ->  Index Scan using ix_contacts_salutation_id on contacts
(cost=0.00..62423.45 rows=57906 width=0) (actual time=0.011..0.011
rows=1 loops=44)
           Index Cond: ($0 = salutation_id)
 Total runtime: 0.594 ms

So, index scan wins by a very small margin over sequential scan after
the tuning. I am a bit puzzled because index scan is more than 3000
times faster in this case, but the estimated costs are about the same.
Did I do something wrong?

Regards,
Yap

Re: reducing random_page_cost from 4 to 2 to force index scan

From

"Kevin Grittner"

Date:

26 April 2011, 18:37:29

Sok Ann Yap <sokann@gmail.com> wrote:

> So, index scan wins by a very small margin over sequential scan
> after the tuning. I am a bit puzzled because index scan is more
> than 3000 times faster in this case, but the estimated costs are
> about the same. Did I do something wrong?

Tuning is generally needed to get best performance from PostgreSQL.
Needing to reduce random_page_cost is not unusual in situations
where a good portion of the active data is in cache (between
shared_buffers and the OS cache).  Please show us your overall
configuration and give a description of the hardware (how many of
what kind of cores, how much RAM, what sort of storage system).  The
configuration part can be obtained by running the query on this page
and pasting the result into your next post:

http://wiki.postgresql.org/wiki/Server_Configuration

There are probably some other configuration adjustments you could do
to ensure that good plans are chosen.

-Kevin

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Merlin Moncure

Date:

26 April 2011, 22:04:33

On Tue, Apr 26, 2011 at 4:37 PM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Sok Ann Yap <sokann@gmail.com> wrote:
>
>> So, index scan wins by a very small margin over sequential scan
>> after the tuning. I am a bit puzzled because index scan is more
>> than 3000 times faster in this case, but the estimated costs are
>> about the same. Did I do something wrong?
>
> Tuning is generally needed to get best performance from PostgreSQL.
> Needing to reduce random_page_cost is not unusual in situations
> where a good portion of the active data is in cache (between
> shared_buffers and the OS cache).  Please show us your overall
> configuration and give a description of the hardware (how many of
> what kind of cores, how much RAM, what sort of storage system).  The
> configuration part can be obtained by running the query on this page
> and pasting the result into your next post:
>
> http://wiki.postgresql.org/wiki/Server_Configuration
>
> There are probably some other configuration adjustments you could do
> to ensure that good plans are chosen.

The very first thing to check is effective_cache_size and to set it to
a reasonable value.

merlin

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Claudio Freire

Date:

27 April 2011, 04:22:50

On Wed, Apr 27, 2011 at 3:04 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> The very first thing to check is effective_cache_size and to set it to
> a reasonable value.
>

The problem there, I think, is that the planner is doing a full join,
instead of a semi-join.

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Claudio Freire

Date:

27 April 2011, 04:23:51

On Wed, Apr 27, 2011 at 9:22 AM, Claudio Freire <klaussfreire@gmail.com> wrote:
> The problem there, I think, is that the planner is doing a full join,
> instead of a semi-join.

Or, rather, computing cost as if it was a full join. I'm not sure why.

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Sok Ann Yap

Date:

28 April 2011, 03:19:29

On Wed, Apr 27, 2011 at 5:37 AM, Kevin Grittner
<Kevin.Grittner@wicourts.gov> wrote:
> Sok Ann Yap <sokann@gmail.com> wrote:
>
>> So, index scan wins by a very small margin over sequential scan
>> after the tuning. I am a bit puzzled because index scan is more
>> than 3000 times faster in this case, but the estimated costs are
>> about the same. Did I do something wrong?
>
> Tuning is generally needed to get best performance from PostgreSQL.
> Needing to reduce random_page_cost is not unusual in situations
> where a good portion of the active data is in cache (between
> shared_buffers and the OS cache).  Please show us your overall
> configuration and give a description of the hardware (how many of
> what kind of cores, how much RAM, what sort of storage system).  The
> configuration part can be obtained by running the query on this page
> and pasting the result into your next post:
>
> http://wiki.postgresql.org/wiki/Server_Configuration
>
> There are probably some other configuration adjustments you could do
> to ensure that good plans are chosen.
>
> -Kevin
>

Here's the configuration (this is just a low end laptop):

            name            |
             current_setting

----------------------------+-------------------------------------------------------------------------------------------------------------------------------
 version                    | PostgreSQL 9.0.4 on x86_64-pc-linux-gnu,
compiled by GCC x86_64-pc-linux-gnu-gcc (Gentoo 4.5.2 p1.0, pie-0.4.5)
4.5.2, 64-bit
 checkpoint_segments        | 16
 default_statistics_target  | 10000
 effective_cache_size       | 512MB
 lc_collate                 | en_US.UTF-8
 lc_ctype                   | en_US.UTF-8
 listen_addresses           | *
 log_destination            | syslog
 log_min_duration_statement | 0
 maintenance_work_mem       | 256MB
 max_connections            | 100
 max_stack_depth            | 2MB
 port                       | 5432
 random_page_cost           | 4
 server_encoding            | UTF8
 shared_buffers             | 256MB
 silent_mode                | on
 TimeZone                   | Asia/Kuala_Lumpur
 wal_buffers                | 1MB
 work_mem                   | 32MB
(20 rows)

The thing is, the query I posted was fairly simple (I think), and
PostgreSQL should be able to choose the 3000+ times faster index scan
with the default random_page_cost of 4. If I need to reduce it to 2
when using a 5.4k rpm slow disk, what is random_page_cost = 4 good
for?

(Sorry for the double message, I forgot to CC the list in the first one)

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Robert Haas

Date:

13 May 2011, 14:44:34

On Tue, Apr 26, 2011 at 9:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Apr 26, 2011 at 4:37 PM, Kevin Grittner
> <Kevin.Grittner@wicourts.gov> wrote:
>> Sok Ann Yap <sokann@gmail.com> wrote:
>>
>>> So, index scan wins by a very small margin over sequential scan
>>> after the tuning. I am a bit puzzled because index scan is more
>>> than 3000 times faster in this case, but the estimated costs are
>>> about the same. Did I do something wrong?
>>
>> Tuning is generally needed to get best performance from PostgreSQL.
>> Needing to reduce random_page_cost is not unusual in situations
>> where a good portion of the active data is in cache (between
>> shared_buffers and the OS cache).  Please show us your overall
>> configuration and give a description of the hardware (how many of
>> what kind of cores, how much RAM, what sort of storage system).  The
>> configuration part can be obtained by running the query on this page
>> and pasting the result into your next post:
>>
>> http://wiki.postgresql.org/wiki/Server_Configuration
>>
>> There are probably some other configuration adjustments you could do
>> to ensure that good plans are chosen.
>
> The very first thing to check is effective_cache_size and to set it to
> a reasonable value.

Actually, effective_cache_size has no impact on costing except when
planning a nested loop with inner index scan.  So, a query against a
single table can never benefit from changing that setting.  Kevin's
suggestion of adjusting seq_page_cost and random_page_cost is the way
to go.

We've talked in the past (and I still think it's a good idea, but
haven't gotten around to doing anything about it) about adjusting the
planner to attribute to each relation the percentage of its pages
which we believe we'll find in cache.  Although many complicated ideas
for determining that percentage have been proposed, my favorite one is
fairly simple: assume that small relations will be mostly or entirely
cached, and that big ones won't be.   Allow the administrator to
override the result on a per-relation basis.  It's difficult to
imagine a situation where the planner should assume that a relation
with only handful of pages isn't going to be cached.  Even if it
isn't, as soon as someone begins accessing it, it will be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: reducing random_page_cost from 4 to 2 to force index scan

From

"Kevin Grittner"

Date:

13 May 2011, 14:55:10

Robert Haas <robertmhaas@gmail.com> wrote:

> We've talked in the past (and I still think it's a good idea, but
> haven't gotten around to doing anything about it) about adjusting
> the planner to attribute to each relation the percentage of its
> pages which we believe we'll find in cache.  Although many
> complicated ideas for determining that percentage have been
> proposed, my favorite one is fairly simple: assume that small
> relations will be mostly or entirely cached, and that big ones
> won't be.   Allow the administrator to override the result on a
> per-relation basis.  It's difficult to imagine a situation where
> the planner should assume that a relation with only handful of
> pages isn't going to be cached.  Even if it isn't, as soon as
> someone begins accessing it, it will be.

Simple as the heuristic is, I bet it would be effective.  While one
can easily construct a synthetic case where it falls down, the ones
I can think of aren't all that common, and you are suggesting an
override mechanism.

-Kevin

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Tom Lane

Date:

13 May 2011, 16:20:47

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Apr 26, 2011 at 9:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> The very first thing to check is effective_cache_size and to set it to
>> a reasonable value.

> Actually, effective_cache_size has no impact on costing except when
> planning a nested loop with inner index scan.  So, a query against a
> single table can never benefit from changing that setting.

That's flat out wrong.  It does affect the cost estimate for plain
indexscan (and bitmap indexscan) plans.

            regards, tom lane

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Robert Haas

Date:

13 May 2011, 17:04:49

On Fri, May 13, 2011 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Apr 26, 2011 at 9:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>> The very first thing to check is effective_cache_size and to set it to
>>> a reasonable value.
>
>> Actually, effective_cache_size has no impact on costing except when
>> planning a nested loop with inner index scan.  So, a query against a
>> single table can never benefit from changing that setting.
>
> That's flat out wrong.  It does affect the cost estimate for plain
> indexscan (and bitmap indexscan) plans.

<rereads code>

OK, I agree.  I obviously misinterpreted this code the last time I read it.

I guess maybe the reason why it didn't matter for the OP is that - if
the size of the index page in pages is smaller than the pro-rated
fraction of effective_cache_size allowed to the index - then the exact
value doesn't affect the answer.

I apparently need to study this code more.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Josh Berkus

Date:

13 May 2011, 17:13:58

> I guess maybe the reason why it didn't matter for the OP is that - if
> the size of the index page in pages is smaller than the pro-rated
> fraction of effective_cache_size allowed to the index - then the exact
> value doesn't affect the answer.
>
> I apparently need to study this code more.

FWIW: random_page_cost is meant to be the ratio between the cost of
looking up a single row as and index lookup, and the cost of looking up
that same row as part of a larger sequential scan.  For specific
storage, that coefficient should be roughly the same regardless of the
table size.  So if your plan for optimization involves manipulating RPC
for anything other than a change of storage, you're Doing It Wrong.

Instead, we should be fixing the formulas these are based on and leaving
RPC alone.

For any data page, there are actually four costs associated with each
tuple lookup, per:

in-memory/seq    | on disk/seq
----------------+----------------
in-memory/random| on disk/random

(yes, there's actually more for bitmapscan etc.  but the example holds)

For any given tuple lookup, then, you can assign a cost based on where
you think that tuple falls in that quadrant map.  Since this is all
probability-based, you'd be assigning a cost as a mixed % of in-memory
and on-disk costs.  Improvements in accuracy of this formula would come
through improvements in accuracy in predicting if a particular data page
will be in memory.

This is what the combination of random_page_cost and
effective_cache_size ought to supply, but I don't think it does, quite.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Cédric Villemain

Date:

13 May 2011, 17:51:42

2011/5/13 Josh Berkus <josh@agliodbs.com>:
>
>> I guess maybe the reason why it didn't matter for the OP is that - if
>> the size of the index page in pages is smaller than the pro-rated
>> fraction of effective_cache_size allowed to the index - then the exact
>> value doesn't affect the answer.
>>
>> I apparently need to study this code more.
>
> FWIW: random_page_cost is meant to be the ratio between the cost of
> looking up a single row as and index lookup, and the cost of looking up
> that same row as part of a larger sequential scan.  For specific
> storage, that coefficient should be roughly the same regardless of the
> table size.  So if your plan for optimization involves manipulating RPC
> for anything other than a change of storage, you're Doing It Wrong.
>
> Instead, we should be fixing the formulas these are based on and leaving
> RPC alone.
>
> For any data page, there are actually four costs associated with each
> tuple lookup, per:
>
> in-memory/seq   | on disk/seq
> ----------------+----------------
> in-memory/random| on disk/random

it lacks some more theorical like sort_page/temp_page : those are
based on a ratio of seq_page_cost and random_page_cost or a simple
seq_page_cost (when working out of work_mem)

memory access is accounted with some 0.1 in some place AFAIR.
(and memory random/seq is the same at the level of estimations we do)

>
> (yes, there's actually more for bitmapscan etc.  but the example holds)

(if I read correctly the sources, for this one there is a linear
approach to ponderate the cost between random_page cost and
seq_page_cost on the heap page fetch plus the Mackert and Lohman
formula, if needed, in its best usage : predicting what should be in
cache *because* of the current query execution, not because of the
current status of the page cache)

>
> For any given tuple lookup, then, you can assign a cost based on where
> you think that tuple falls in that quadrant map.  Since this is all
> probability-based, you'd be assigning a cost as a mixed % of in-memory
> and on-disk costs.  Improvements in accuracy of this formula would come
> through improvements in accuracy in predicting if a particular data page
> will be in memory.
>
> This is what the combination of random_page_cost and
> effective_cache_size ought to supply, but I don't think it does, quite.
>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Robert Haas

Date:

13 May 2011, 23:20:26

On Fri, May 13, 2011 at 4:13 PM, Josh Berkus <josh@agliodbs.com> wrote:
> Instead, we should be fixing the formulas these are based on and leaving
> RPC alone.
>
> For any data page, there are actually four costs associated with each
> tuple lookup, per:

All true.  I suspect that in practice the different between random and
sequential memory page costs is small enough to be ignorable, although
of course I might be wrong.  I've never seen a database that was fully
cached in memory where it was necessary to set
random_page_cost>seq_page_cost to get good plans -- no doubt partly
because even if the pages were consecutive on disk, there's no reason
to suppose they would be so in memory, and we certainly wouldn't know
one way or the other at planning time.   But I agree we should add a
cached_page_cost as part of all this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Stuart Bishop

Date:

15 May 2011, 01:20:08

On Sat, May 14, 2011 at 3:13 AM, Josh Berkus <josh@agliodbs.com> wrote:

> This is what the combination of random_page_cost and
> effective_cache_size ought to supply, but I don't think it does, quite.

I think random_page_cost causes problems because I need to combine
disk random access time, which I can measure, with a guesstimate of
the disk cache hit rate. It would be lovely if these two variables
were separate. It would be even lovelier if the disk cache hit rate
could be probed at run time and didn't need setting at all, but I
suspect that isn't possible on some platforms.

--
Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Josh Berkus

Date:

15 May 2011, 15:08:58

Robert,

> All true.  I suspect that in practice the different between random and
> sequential memory page costs is small enough to be ignorable, although
> of course I might be wrong.

This hasn't been my experience, although I have not carefully measured
it.  In fact, there's good reason to suppose that, if you were selecting
50% of more of a table, sequential access would still be faster even for
an entirely in-memory table.

As a parallel to our development, Redis used to store all data as linked
lists, making every object lookup effectively a random lookup.  They
found that even with a database which is pinned in memory, creating a
data page structure (they call it "ziplists") and supporting sequential
scans was up to 10X faster for large lists.

So I would assume that there is still a coefficient difference between
seeks and scans in memory until proven otherwise.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Josh Berkus

Date:

15 May 2011, 15:20:44

Stuart,

> I think random_page_cost causes problems because I need to combine
> disk random access time, which I can measure, with a guesstimate of
> the disk cache hit rate.

See, that's wrong. Disk cache hit rate is what effective_cache_size
(ECS) is for.

Really, there's several factors which should be going into the planner's
estimates to determine a probability of a table being cached:

* ratio between total database size and ECS
* ratio between table size and ECS
* ratio between index size and ECS
* whether the table is "hot" or not
* whether the index is "hot" or not

The last two statistics are critically important for good estimation,
and they are not things we currently collect.  By "hot" I mean: is this
a relation which is accessed several times per minute/hour and is thus
likely to be in the cache when we need it?  Currently, we have no way of
knowing that.

Without "hot" statistics, we're left with guessing based on size, which
results in bad plans for small tables in large databases which are
accessed infrequently.

Mind you, for large tables it would be even better to go beyond that and
actually have some knowledge of which disk pages might be in cache.
However, I think that's beyond feasibility for current software/OSes.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Robert Haas

Date:

15 May 2011, 17:32:11

On Sun, May 15, 2011 at 2:08 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> All true.  I suspect that in practice the different between random and
>> sequential memory page costs is small enough to be ignorable, although
>> of course I might be wrong.
>
> This hasn't been my experience, although I have not carefully measured
> it.  In fact, there's good reason to suppose that, if you were selecting
> 50% of more of a table, sequential access would still be faster even for
> an entirely in-memory table.
>
> As a parallel to our development, Redis used to store all data as linked
> lists, making every object lookup effectively a random lookup.  They
> found that even with a database which is pinned in memory, creating a
> data page structure (they call it "ziplists") and supporting sequential
> scans was up to 10X faster for large lists.
>
> So I would assume that there is still a coefficient difference between
> seeks and scans in memory until proven otherwise.

Well, anything's possible.  But I wonder whether the effects you are
describing might result from a reduction in the *number* of pages
accessed rather than a change in the access pattern.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Cédric Villemain

Date:

15 May 2011, 18:46:24

2011/5/15 Josh Berkus <josh@agliodbs.com>:
> Stuart,
>
>> I think random_page_cost causes problems because I need to combine
>> disk random access time, which I can measure, with a guesstimate of
>> the disk cache hit rate.
>
> See, that's wrong. Disk cache hit rate is what effective_cache_size
> (ECS) is for.
>
> Really, there's several factors which should be going into the planner's
> estimates to determine a probability of a table being cached:
>
> * ratio between total database size and ECS
> * ratio between table size and ECS
> * ratio between index size and ECS
> * whether the table is "hot" or not
> * whether the index is "hot" or not
>
> The last two statistics are critically important for good estimation,
> and they are not things we currently collect.  By "hot" I mean: is this
> a relation which is accessed several times per minute/hour and is thus
> likely to be in the cache when we need it?  Currently, we have no way of
> knowing that.
>
> Without "hot" statistics, we're left with guessing based on size, which
> results in bad plans for small tables in large databases which are
> accessed infrequently.
>
> Mind you, for large tables it would be even better to go beyond that and
> actually have some knowledge of which

*which* ?
 do you mean 'area' of the tables ?

> disk pages might be in cache.
> However, I think that's beyond feasibility for current software/OSes.

maybe not :) mincore is available in many OSes, and windows have
options to get those stats too.

>
> --
> Josh Berkus
> PostgreSQL Experts Inc.
> http://pgexperts.com
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Craig Ringer

Date:

15 May 2011, 22:05:40

On 16/05/11 05:45, Cédric Villemain wrote:
> 2011/5/15 Josh Berkus <josh@agliodbs.com>:
>> disk pages might be in cache.
>> However, I think that's beyond feasibility for current software/OSes.
>
> maybe not :) mincore is available in many OSes, and windows have
> options to get those stats too.

AFAIK, mincore() is only useful for mmap()ed files and for finding out
if it's safe to access certain blocks of memory w/o risking triggering
heavy swapping.

It doesn't provide any visibility into the OS's block device / file
system caches; you can't ask it "how much of this file is cached in RAM"
or "is this range of blocks in this file cached in RAM".

Even if you could, it's hard to see how an approach that relied on
asking the OS through system calls about the cache state when planning
every query could be fast enough to be viable.

--
Craig Ringer

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Greg Smith

Date:

15 May 2011, 22:18:37

Craig Ringer wrote:
> AFAIK, mincore() is only useful for mmap()ed files and for finding out
> if it's safe to access certain blocks of memory w/o risking triggering
> heavy swapping.
>
> It doesn't provide any visibility into the OS's block device / file
> system caches; you can't ask it "how much of this file is cached in RAM"
> or "is this range of blocks in this file cached in RAM".
>

You should try out pgfincore if you think this can't be done!

> Even if you could, it's hard to see how an approach that relied on
> asking the OS through system calls about the cache state when planning
> every query could be fast enough to be viable.
>

You can't do it in real-time.  You don't necessarily want that to even
if it were possible; too many possibilities for nasty feedback loops
where you always favor using some marginal index that happens to be in
memory, and therefore never page in things that would be faster once
they're read.  The only reasonable implementation that avoids completely
unstable plans is to scan this data periodically and save some
statistics on it--the way ANALYZE does--and then have that turn into a
planner input.

The related secondary idea of just making assumptions about small
tables/indexes, too, may be a useful heuristic to layer on top of this.
There's a pile of ideas here that all seem reasonable both in terms of
modeling real-world behavior and as things that could be inserted into
the optimizer.  As usual, I suspect that work is needs to be followed by
a giant testing exercise though.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Jesper Krogh

Date:

16 May 2011, 01:42:12

On 2011-05-16 03:18, Greg Smith wrote:
> You can't do it in real-time. You don't necessarily want that to
> even if it were possible; too many possibilities for nasty feedback
> loops where you always favor using some marginal index that happens
> to be in memory, and therefore never page in things that would be
> faster once they're read. The only reasonable implementation that
> avoids completely unstable plans is to scan this data periodically
> and save some statistics on it--the way ANALYZE does--and then have
> that turn into a planner input.

Would that be feasible? Have process collecting the data every now-and-then
probably picking some conservative-average function and feeding
it into pg_stats for each index/relation?

To me it seems like a robust and fairly trivial way to to get better numbers. The
fear is that the OS-cache is too much in flux to get any stable numbers out
of it.

--
Jesper

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Jesper Krogh

Date:

16 May 2011, 01:49:33

On 2011-05-16 06:41, Jesper Krogh wrote:
> On 2011-05-16 03:18, Greg Smith wrote:
>>  You can't do it in real-time. You don't necessarily want that to
>>  even if it were possible; too many possibilities for nasty feedback
>>  loops where you always favor using some marginal index that happens
>>  to be in memory, and therefore never page in things that would be
>>  faster once they're read. The only reasonable implementation that
>>  avoids completely unstable plans is to scan this data periodically
>>  and save some statistics on it--the way ANALYZE does--and then have
>>  that turn into a planner input.
>
> Would that be feasible? Have process collecting the data every
> now-and-then
> probably picking some conservative-average function and feeding
> it into pg_stats for each index/relation?
>
> To me it seems like a robust and fairly trivial way to to get better
> numbers. The
> fear is that the OS-cache is too much in flux to get any stable
> numbers out
> of it.

Ok, it may not work as well with index'es, since having 1% in cache may very
well mean that 90% of all requested blocks are there.. for tables in should
be more trivial.

--
Jesper

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Robert Haas

Date:

16 May 2011, 11:34:20

On Mon, May 16, 2011 at 12:49 AM, Jesper Krogh <jesper@krogh.cc> wrote:
>> To me it seems like a robust and fairly trivial way to to get better
>> numbers. The
>> fear is that the OS-cache is too much in flux to get any stable numbers
>> out
>> of it.
>
> Ok, it may not work as well with index'es, since having 1% in cache may very
> well mean that 90% of all requested blocks are there.. for tables in should
> be more trivial.

Tables can have hot spots, too.  Consider a table that holds calendar
reservations.  Reservations can be inserted, updated, deleted.  But
typically, the most recent data will be what is most actively
modified, and the older data will be relatively more (though not
completely) static, and less frequently accessed.  Such examples are
common in many real-world applications.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Tom Lane

Date:

16 May 2011, 12:46:58

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, May 16, 2011 at 12:49 AM, Jesper Krogh <jesper@krogh.cc> wrote:
>> Ok, it may not work as well with index'es, since having 1% in cache may very
>> well mean that 90% of all requested blocks are there.. for tables in should
>> be more trivial.

> Tables can have hot spots, too.  Consider a table that holds calendar
> reservations.  Reservations can be inserted, updated, deleted.  But
> typically, the most recent data will be what is most actively
> modified, and the older data will be relatively more (though not
> completely) static, and less frequently accessed.  Such examples are
> common in many real-world applications.

Yes.  I'm not convinced that measuring the fraction of a table or index
that's in cache is really going to help us much.  Historical cache hit
rates might be useful, but only to the extent that the incoming query
has a similar access pattern to those in the (recent?) past.  It's not
an easy problem.

I almost wonder if we should not try to measure this at all, but instead
let the DBA set a per-table or per-index number to use, analogous to the
override we added recently for column n-distinct statistics ...

            regards, tom lane

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Jeff Janes

Date:

16 May 2011, 13:45:37

On Sun, May 15, 2011 at 9:49 PM, Jesper Krogh <jesper@krogh.cc> wrote:
>
> Ok, it may not work as well with index'es, since having 1% in cache may very
> well mean that 90% of all requested blocks are there.. for tables in should
> be more trivial.

Why would the index have a meaningful hot-spot unless the underlying
table had one as well?  (Of course the root block will be a hot-spot,
but certainly not 90% of all requests)

Cheers,

Jeff

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Tom Lane

Date:

16 May 2011, 14:25:03

Jeff Janes <jeff.janes@gmail.com> writes:
> On Sun, May 15, 2011 at 9:49 PM, Jesper Krogh <jesper@krogh.cc> wrote:
>> Ok, it may not work as well with index'es, since having 1% in cache may very
>> well mean that 90% of all requested blocks are there.. for tables in should
>> be more trivial.

> Why would the index have a meaningful hot-spot unless the underlying
> table had one as well?  (Of course the root block will be a hot-spot,
> but certainly not 90% of all requests)

The accesses to an index are far more likely to be clustered than the
accesses to the underlying table, because the index is organized in a
way that's application-meaningful and the table not so much.  Continuing
the earlier example of a timestamp column, accesses might preferentially
hit near the right end of the index while the underlying rows are all
over the table.

IOW, hot spots measured at the row level and hot spots measured at the
page level could very easily be different between table and index.

            regards, tom lane

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Nathan Boley

Date:

16 May 2011, 16:11:10

> The accesses to an index are far more likely to be clustered than the
> accesses to the underlying table, because the index is organized in a
> way that's application-meaningful and the table not so much.

So, to clarify, are you saying that if query were actually requesting
rows uniformly random, then there would be no reason to suspect that
index accesses would have hotspots? It seems like the index structure
( ie, the top node in b-trees ) could also get in the way.

Best,
Nathan

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Tom Lane

Date:

16 May 2011, 16:41:11

Nathan Boley <npboley@gmail.com> writes:
>> The accesses to an index are far more likely to be clustered than the
>> accesses to the underlying table, because the index is organized in a
>> way that's application-meaningful and the table not so much.

> So, to clarify, are you saying that if query were actually requesting
> rows uniformly random, then there would be no reason to suspect that
> index accesses would have hotspots? It seems like the index structure
> ( ie, the top node in b-trees ) could also get in the way.

The upper nodes would tend to stay in cache, yes, but we already assume
that in the index access cost model, in a kind of indirect way: the
model only considers leaf-page accesses in the first place ;-)

            regards, tom lane

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Jim Nasby

Date:

17 May 2011, 16:19:16

On May 16, 2011, at 10:46 AM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, May 16, 2011 at 12:49 AM, Jesper Krogh <jesper@krogh.cc> wrote:
>>> Ok, it may not work as well with index'es, since having 1% in cache may very
>>> well mean that 90% of all requested blocks are there.. for tables in should
>>> be more trivial.
>
>> Tables can have hot spots, too.  Consider a table that holds calendar
>> reservations.  Reservations can be inserted, updated, deleted.  But
>> typically, the most recent data will be what is most actively
>> modified, and the older data will be relatively more (though not
>> completely) static, and less frequently accessed.  Such examples are
>> common in many real-world applications.
>
> Yes.  I'm not convinced that measuring the fraction of a table or index
> that's in cache is really going to help us much.  Historical cache hit
> rates might be useful, but only to the extent that the incoming query
> has a similar access pattern to those in the (recent?) past.  It's not
> an easy problem.
>
> I almost wonder if we should not try to measure this at all, but instead
> let the DBA set a per-table or per-index number to use, analogous to the
> override we added recently for column n-distinct statistics ...

I think the challenge there would be how to define the scope of the hot-spot. Is it the last X pages? Last X serial
values?Something like correlation? 

Hmm... it would be interesting if we had average relation access times for each stats bucket on a per-column basis;
thatwould give the planner a better idea of how much IO overhead there would be for a given WHERE clause. 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Greg Smith

Date:

19 May 2011, 00:00:56

Jim Nasby wrote:
> I think the challenge there would be how to define the scope of the hot-spot. Is it the last X pages? Last X serial
values?Something like correlation? 
>
> Hmm... it would be interesting if we had average relation access times for each stats bucket on a per-column basis;
thatwould give the planner a better idea of how much IO overhead there would be for a given WHERE clause 

You've already given one reasonable first answer to your question here.
If you defined a usage counter for each histogram bucket, and
incremented that each time something from it was touched, that could
lead to a very rough way to determine access distribution.  Compute a
ratio of the counts in those buckets, then have an estimate of the total
cached percentage; multiplying the two will give you an idea how much of
that specific bucket might be in memory.  It's not perfect, and you need
to incorporate some sort of aging method to it (probably weighted
average based), but the basic idea could work.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Robert Haas

Date:

19 May 2011, 11:53:33

On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Jim Nasby wrote:
>> I think the challenge there would be how to define the scope of the
>> hot-spot. Is it the last X pages? Last X serial values? Something like
>> correlation?
>>
>> Hmm... it would be interesting if we had average relation access times for
>> each stats bucket on a per-column basis; that would give the planner a
>> better idea of how much IO overhead there would be for a given WHERE clause
>
> You've already given one reasonable first answer to your question here.  If
> you defined a usage counter for each histogram bucket, and incremented that
> each time something from it was touched, that could lead to a very rough way
> to determine access distribution.  Compute a ratio of the counts in those
> buckets, then have an estimate of the total cached percentage; multiplying
> the two will give you an idea how much of that specific bucket might be in
> memory.  It's not perfect, and you need to incorporate some sort of aging
> method to it (probably weighted average based), but the basic idea could
> work.

Maybe I'm missing something here, but it seems like that would be
nightmarishly slow.  Every time you read a tuple, you'd have to look
at every column of the tuple and determine which histogram bucket it
was in (or, presumably, which MCV it is, since those aren't included
in working out the histogram buckets).  That seems like it would slow
down a sequential scan by at least 10x.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Jim Nasby

Date:

19 May 2011, 15:40:18

On May 19, 2011, at 9:53 AM, Robert Haas wrote:
> On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>> Jim Nasby wrote:
>>> I think the challenge there would be how to define the scope of the
>>> hot-spot. Is it the last X pages? Last X serial values? Something like
>>> correlation?
>>>
>>> Hmm... it would be interesting if we had average relation access times for
>>> each stats bucket on a per-column basis; that would give the planner a
>>> better idea of how much IO overhead there would be for a given WHERE clause
>>
>> You've already given one reasonable first answer to your question here.  If
>> you defined a usage counter for each histogram bucket, and incremented that
>> each time something from it was touched, that could lead to a very rough way
>> to determine access distribution.  Compute a ratio of the counts in those
>> buckets, then have an estimate of the total cached percentage; multiplying
>> the two will give you an idea how much of that specific bucket might be in
>> memory.  It's not perfect, and you need to incorporate some sort of aging
>> method to it (probably weighted average based), but the basic idea could
>> work.
>
> Maybe I'm missing something here, but it seems like that would be
> nightmarishly slow.  Every time you read a tuple, you'd have to look
> at every column of the tuple and determine which histogram bucket it
> was in (or, presumably, which MCV it is, since those aren't included
> in working out the histogram buckets).  That seems like it would slow
> down a sequential scan by at least 10x.

You definitely couldn't do it real-time. But you might be able to copy the tuple somewhere and have a background
processdo the analysis. 

That said, it might be more productive to know what blocks are available in memory and use correlation to guesstimate
whethera particular query will need hot or cold blocks. Or perhaps we create a different structure that lets you track
thedistribution of each column linearly through the table; something more sophisticated than just using correlation....
perhapssomething like indicating which stats bucket was most prevalent in each block/range of blocks in a table. That
informationwould allow you to estimate exactly what blocks in the table you're likely to need... 
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Robert Haas

Date:

19 May 2011, 17:07:50

On Thu, May 19, 2011 at 2:39 PM, Jim Nasby <jim@nasby.net> wrote:
> On May 19, 2011, at 9:53 AM, Robert Haas wrote:
>> On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>> Jim Nasby wrote:
>>>> I think the challenge there would be how to define the scope of the
>>>> hot-spot. Is it the last X pages? Last X serial values? Something like
>>>> correlation?
>>>>
>>>> Hmm... it would be interesting if we had average relation access times for
>>>> each stats bucket on a per-column basis; that would give the planner a
>>>> better idea of how much IO overhead there would be for a given WHERE clause
>>>
>>> You've already given one reasonable first answer to your question here.  If
>>> you defined a usage counter for each histogram bucket, and incremented that
>>> each time something from it was touched, that could lead to a very rough way
>>> to determine access distribution.  Compute a ratio of the counts in those
>>> buckets, then have an estimate of the total cached percentage; multiplying
>>> the two will give you an idea how much of that specific bucket might be in
>>> memory.  It's not perfect, and you need to incorporate some sort of aging
>>> method to it (probably weighted average based), but the basic idea could
>>> work.
>>
>> Maybe I'm missing something here, but it seems like that would be
>> nightmarishly slow.  Every time you read a tuple, you'd have to look
>> at every column of the tuple and determine which histogram bucket it
>> was in (or, presumably, which MCV it is, since those aren't included
>> in working out the histogram buckets).  That seems like it would slow
>> down a sequential scan by at least 10x.
>
> You definitely couldn't do it real-time. But you might be able to copy the tuple somewhere and have a background
processdo the analysis. 
>
> That said, it might be more productive to know what blocks are available in memory and use correlation to guesstimate
whethera particular query will need hot or cold blocks. Or perhaps we create a different structure that lets you track
thedistribution of each column linearly through the table; something more sophisticated than just using correlation....
perhapssomething like indicating which stats bucket was most prevalent in each block/range of blocks in a table. That
informationwould allow you to estimate exactly what blocks in the table you're likely to need... 

Well, all of that stuff sounds impractically expensive to me... but I
just work here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Cédric Villemain

Date:

19 May 2011, 19:28:04

2011/5/19 Jim Nasby <jim@nasby.net>:
> On May 19, 2011, at 9:53 AM, Robert Haas wrote:
>> On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>> Jim Nasby wrote:
>>>> I think the challenge there would be how to define the scope of the
>>>> hot-spot. Is it the last X pages? Last X serial values? Something like
>>>> correlation?
>>>>
>>>> Hmm... it would be interesting if we had average relation access times for
>>>> each stats bucket on a per-column basis; that would give the planner a
>>>> better idea of how much IO overhead there would be for a given WHERE clause
>>>
>>> You've already given one reasonable first answer to your question here.  If
>>> you defined a usage counter for each histogram bucket, and incremented that
>>> each time something from it was touched, that could lead to a very rough way
>>> to determine access distribution.  Compute a ratio of the counts in those
>>> buckets, then have an estimate of the total cached percentage; multiplying
>>> the two will give you an idea how much of that specific bucket might be in
>>> memory.  It's not perfect, and you need to incorporate some sort of aging
>>> method to it (probably weighted average based), but the basic idea could
>>> work.
>>
>> Maybe I'm missing something here, but it seems like that would be
>> nightmarishly slow.  Every time you read a tuple, you'd have to look
>> at every column of the tuple and determine which histogram bucket it
>> was in (or, presumably, which MCV it is, since those aren't included
>> in working out the histogram buckets).  That seems like it would slow
>> down a sequential scan by at least 10x.
>
> You definitely couldn't do it real-time. But you might be able to copy the tuple somewhere and have a background
processdo the analysis. 
>
> That said, it might be more productive to know what blocks are available in memory and use correlation to guesstimate
whethera particular query will need hot or cold blocks. Or perhaps we create a different structure that lets you track
thedistribution of each column linearly through the table; something more sophisticated than just using correlation....
perhapssomething like indicating which stats bucket was most prevalent in each block/range of blocks in a table. That
informationwould allow you to estimate exactly what blocks in the table you're likely to need... 

Those are very good ideas I would get in mind for vacuum/checkpoint
tasks: if you are able to know hot and cold data, then order it in the
segments of the relation. But making it work at the planner level
looks hard. I am not opposed to the idea, but no idea how to do it
right now.

> --
> Jim C. Nasby, Database Architect                   jim@nasby.net
> 512.569.9461 (cell)                         http://jim.nasby.net
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Josh Berkus

Date:

23 May 2011, 16:08:17

> Well, all of that stuff sounds impractically expensive to me... but I
> just work here.

I'll point out that the simple version, which just checks for hot tables
and indexes, would improve estimates greatly and be a LOT less
complicated than these proposals.  Certainly having some form of
block-based or range-based stats would be better, but it also sounds
hard enough to maybe never get done.

Having user-accessible "hot" stats would also be useful to DBAs.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Robert Haas

Date:

23 May 2011, 23:44:34

On Mon, May 23, 2011 at 3:08 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
>> Well, all of that stuff sounds impractically expensive to me... but I
>> just work here.
>
> I'll point out that the simple version, which just checks for hot tables
> and indexes, would improve estimates greatly and be a LOT less
> complicated than these proposals.

I realize I'm sounding like a broken record here, but as far as I can
tell there is absolutely zero evidence that that would be better.  I'm
sure you're in good company thinking so, but the list of things that
could skew (or should I say, screw) the estimates is long and painful;
and if those estimates are wrong, you'll end up with something that is
both worse and less predictable than the status quo.  First, I haven't
seen a shred of hard evidence that the contents of the buffer cache or
OS cache are stable enough to be relied upon, and we've repeatedly
discussed workloads where that might not be true.  Has anyone done a
systematic study of this on a variety real production systems?  If so,
the results haven't been posted here, at least not that I can recall.
Second, even if we were willing to accept that we could obtain
relatively stable and accurate measurements of this data, who is to
say that basing plans on it would actually result in an improvement in
plan quality?  That may seem obvious, but I don't think it is.  The
proposed method is a bit like trying to determine the altitude of a
hot air balloon by throwing the ballast over the side and timing how
long it takes to hit the ground.  Executing plans that are based on
the contents of the cache will change the contents of the cache, which
will in turn change the plans.  The idea that we can know, without any
experimentation, how that's going to shake out, seems to me to be an
exercise in unjustified optimism of the first order.

Sorry to sound grumpy and pessimistic, but I really think we're
letting our enthusiasm get way, way ahead of the evidence.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: reducing random_page_cost from 4 to 2 to force index scan

From

Vitalii Tymchyshyn

Date:

24 May 2011, 06:12:38

Hello.

As of me, all this "hot" thing really looks like uncertain and dynamic
enough.
Two things that I could directly use right now (and they are needed in
pair) are:
1)Per-table/index/database bufferpools (split shared buffer into parts,
allow to specify which index/table/database goes where)
2)Per-table/index cost settings

If I had this, I could allocate specific bufferpools for tables/indexes
that MUST be hot in memory and set low costs for this specific tables.
P.S. Third thing, great to have to companion this two is "Load on
startup" flag to automatically populate bufferpools with fast sequential
read, but this can be easily emulated with a statement.

Best regards, Vitalii Tymchyshyn