Thread: reducing random_page_cost from 4 to 2 to force index scan
Hi, I am using PostgreSQL 9.0. There is a salutations table with 44 rows, and a contacts table with more than a million rows. The contacts table has a nullable (only 0.002% null) salutation_id column, referencing salutations.id. With this query: SELECT salutations.id, salutations.name, salutations.description, EXISTS ( SELECT 1 FROM contacts WHERE salutations.id = contacts.salutation_id ) AS in_use FROM salutations I have to reduce random_page_cost from 4 to 2 to force index scan. EXPLAIN ANALYSE output with random_page_cost = 4: Seq Scan on salutations (cost=0.00..50.51 rows=44 width=229) (actual time=0.188..3844.037 rows=44 loops=1) SubPlan 1 -> Seq Scan on contacts (cost=0.00..64578.41 rows=57906 width=0) (actual time=87.358..87.358 rows=1 loops=44) Filter: ($0 = salutation_id) Total runtime: 3844.113 ms EXPLAIN ANALYSE output with random_page_cost = 4, enable_seqscan = 0: Seq Scan on salutations (cost=10000000000.00..10000000095.42 rows=44 width=229) (actual time=0.053..0.542 rows=44 loops=1) SubPlan 1 -> Index Scan using ix_contacts_salutation_id on contacts (cost=0.00..123682.07 rows=57906 width=0) (actual time=0.011..0.011 rows=1 loops=44) Index Cond: ($0 = salutation_id) Total runtime: 0.592 ms EXPLAIN ANALYSE output with random_page_cost = 2: Seq Scan on salutations (cost=0.00..48.87 rows=44 width=229) (actual time=0.053..0.541 rows=44 loops=1) SubPlan 1 -> Index Scan using ix_contacts_salutation_id on contacts (cost=0.00..62423.45 rows=57906 width=0) (actual time=0.011..0.011 rows=1 loops=44) Index Cond: ($0 = salutation_id) Total runtime: 0.594 ms So, index scan wins by a very small margin over sequential scan after the tuning. I am a bit puzzled because index scan is more than 3000 times faster in this case, but the estimated costs are about the same. Did I do something wrong? Regards, Yap
Sok Ann Yap <sokann@gmail.com> wrote: > So, index scan wins by a very small margin over sequential scan > after the tuning. I am a bit puzzled because index scan is more > than 3000 times faster in this case, but the estimated costs are > about the same. Did I do something wrong? Tuning is generally needed to get best performance from PostgreSQL. Needing to reduce random_page_cost is not unusual in situations where a good portion of the active data is in cache (between shared_buffers and the OS cache). Please show us your overall configuration and give a description of the hardware (how many of what kind of cores, how much RAM, what sort of storage system). The configuration part can be obtained by running the query on this page and pasting the result into your next post: http://wiki.postgresql.org/wiki/Server_Configuration There are probably some other configuration adjustments you could do to ensure that good plans are chosen. -Kevin
On Tue, Apr 26, 2011 at 4:37 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Sok Ann Yap <sokann@gmail.com> wrote: > >> So, index scan wins by a very small margin over sequential scan >> after the tuning. I am a bit puzzled because index scan is more >> than 3000 times faster in this case, but the estimated costs are >> about the same. Did I do something wrong? > > Tuning is generally needed to get best performance from PostgreSQL. > Needing to reduce random_page_cost is not unusual in situations > where a good portion of the active data is in cache (between > shared_buffers and the OS cache). Please show us your overall > configuration and give a description of the hardware (how many of > what kind of cores, how much RAM, what sort of storage system). The > configuration part can be obtained by running the query on this page > and pasting the result into your next post: > > http://wiki.postgresql.org/wiki/Server_Configuration > > There are probably some other configuration adjustments you could do > to ensure that good plans are chosen. The very first thing to check is effective_cache_size and to set it to a reasonable value. merlin
On Wed, Apr 27, 2011 at 3:04 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > The very first thing to check is effective_cache_size and to set it to > a reasonable value. > The problem there, I think, is that the planner is doing a full join, instead of a semi-join.
On Wed, Apr 27, 2011 at 9:22 AM, Claudio Freire <klaussfreire@gmail.com> wrote: > The problem there, I think, is that the planner is doing a full join, > instead of a semi-join. Or, rather, computing cost as if it was a full join. I'm not sure why.
On Wed, Apr 27, 2011 at 5:37 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Sok Ann Yap <sokann@gmail.com> wrote: > >> So, index scan wins by a very small margin over sequential scan >> after the tuning. I am a bit puzzled because index scan is more >> than 3000 times faster in this case, but the estimated costs are >> about the same. Did I do something wrong? > > Tuning is generally needed to get best performance from PostgreSQL. > Needing to reduce random_page_cost is not unusual in situations > where a good portion of the active data is in cache (between > shared_buffers and the OS cache). Please show us your overall > configuration and give a description of the hardware (how many of > what kind of cores, how much RAM, what sort of storage system). The > configuration part can be obtained by running the query on this page > and pasting the result into your next post: > > http://wiki.postgresql.org/wiki/Server_Configuration > > There are probably some other configuration adjustments you could do > to ensure that good plans are chosen. > > -Kevin > Here's the configuration (this is just a low end laptop): name | current_setting ----------------------------+------------------------------------------------------------------------------------------------------------------------------- version | PostgreSQL 9.0.4 on x86_64-pc-linux-gnu, compiled by GCC x86_64-pc-linux-gnu-gcc (Gentoo 4.5.2 p1.0, pie-0.4.5) 4.5.2, 64-bit checkpoint_segments | 16 default_statistics_target | 10000 effective_cache_size | 512MB lc_collate | en_US.UTF-8 lc_ctype | en_US.UTF-8 listen_addresses | * log_destination | syslog log_min_duration_statement | 0 maintenance_work_mem | 256MB max_connections | 100 max_stack_depth | 2MB port | 5432 random_page_cost | 4 server_encoding | UTF8 shared_buffers | 256MB silent_mode | on TimeZone | Asia/Kuala_Lumpur wal_buffers | 1MB work_mem | 32MB (20 rows) The thing is, the query I posted was fairly simple (I think), and PostgreSQL should be able to choose the 3000+ times faster index scan with the default random_page_cost of 4. If I need to reduce it to 2 when using a 5.4k rpm slow disk, what is random_page_cost = 4 good for? (Sorry for the double message, I forgot to CC the list in the first one)
On Tue, Apr 26, 2011 at 9:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Apr 26, 2011 at 4:37 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> Sok Ann Yap <sokann@gmail.com> wrote: >> >>> So, index scan wins by a very small margin over sequential scan >>> after the tuning. I am a bit puzzled because index scan is more >>> than 3000 times faster in this case, but the estimated costs are >>> about the same. Did I do something wrong? >> >> Tuning is generally needed to get best performance from PostgreSQL. >> Needing to reduce random_page_cost is not unusual in situations >> where a good portion of the active data is in cache (between >> shared_buffers and the OS cache). Please show us your overall >> configuration and give a description of the hardware (how many of >> what kind of cores, how much RAM, what sort of storage system). The >> configuration part can be obtained by running the query on this page >> and pasting the result into your next post: >> >> http://wiki.postgresql.org/wiki/Server_Configuration >> >> There are probably some other configuration adjustments you could do >> to ensure that good plans are chosen. > > The very first thing to check is effective_cache_size and to set it to > a reasonable value. Actually, effective_cache_size has no impact on costing except when planning a nested loop with inner index scan. So, a query against a single table can never benefit from changing that setting. Kevin's suggestion of adjusting seq_page_cost and random_page_cost is the way to go. We've talked in the past (and I still think it's a good idea, but haven't gotten around to doing anything about it) about adjusting the planner to attribute to each relation the percentage of its pages which we believe we'll find in cache. Although many complicated ideas for determining that percentage have been proposed, my favorite one is fairly simple: assume that small relations will be mostly or entirely cached, and that big ones won't be. Allow the administrator to override the result on a per-relation basis. It's difficult to imagine a situation where the planner should assume that a relation with only handful of pages isn't going to be cached. Even if it isn't, as soon as someone begins accessing it, it will be. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote: > We've talked in the past (and I still think it's a good idea, but > haven't gotten around to doing anything about it) about adjusting > the planner to attribute to each relation the percentage of its > pages which we believe we'll find in cache. Although many > complicated ideas for determining that percentage have been > proposed, my favorite one is fairly simple: assume that small > relations will be mostly or entirely cached, and that big ones > won't be. Allow the administrator to override the result on a > per-relation basis. It's difficult to imagine a situation where > the planner should assume that a relation with only handful of > pages isn't going to be cached. Even if it isn't, as soon as > someone begins accessing it, it will be. Simple as the heuristic is, I bet it would be effective. While one can easily construct a synthetic case where it falls down, the ones I can think of aren't all that common, and you are suggesting an override mechanism. -Kevin
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Apr 26, 2011 at 9:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> The very first thing to check is effective_cache_size and to set it to >> a reasonable value. > Actually, effective_cache_size has no impact on costing except when > planning a nested loop with inner index scan. So, a query against a > single table can never benefit from changing that setting. That's flat out wrong. It does affect the cost estimate for plain indexscan (and bitmap indexscan) plans. regards, tom lane
On Fri, May 13, 2011 at 3:20 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Apr 26, 2011 at 9:04 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> The very first thing to check is effective_cache_size and to set it to >>> a reasonable value. > >> Actually, effective_cache_size has no impact on costing except when >> planning a nested loop with inner index scan. So, a query against a >> single table can never benefit from changing that setting. > > That's flat out wrong. It does affect the cost estimate for plain > indexscan (and bitmap indexscan) plans. <rereads code> OK, I agree. I obviously misinterpreted this code the last time I read it. I guess maybe the reason why it didn't matter for the OP is that - if the size of the index page in pages is smaller than the pro-rated fraction of effective_cache_size allowed to the index - then the exact value doesn't affect the answer. I apparently need to study this code more. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> I guess maybe the reason why it didn't matter for the OP is that - if > the size of the index page in pages is smaller than the pro-rated > fraction of effective_cache_size allowed to the index - then the exact > value doesn't affect the answer. > > I apparently need to study this code more. FWIW: random_page_cost is meant to be the ratio between the cost of looking up a single row as and index lookup, and the cost of looking up that same row as part of a larger sequential scan. For specific storage, that coefficient should be roughly the same regardless of the table size. So if your plan for optimization involves manipulating RPC for anything other than a change of storage, you're Doing It Wrong. Instead, we should be fixing the formulas these are based on and leaving RPC alone. For any data page, there are actually four costs associated with each tuple lookup, per: in-memory/seq | on disk/seq ----------------+---------------- in-memory/random| on disk/random (yes, there's actually more for bitmapscan etc. but the example holds) For any given tuple lookup, then, you can assign a cost based on where you think that tuple falls in that quadrant map. Since this is all probability-based, you'd be assigning a cost as a mixed % of in-memory and on-disk costs. Improvements in accuracy of this formula would come through improvements in accuracy in predicting if a particular data page will be in memory. This is what the combination of random_page_cost and effective_cache_size ought to supply, but I don't think it does, quite. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
2011/5/13 Josh Berkus <josh@agliodbs.com>: > >> I guess maybe the reason why it didn't matter for the OP is that - if >> the size of the index page in pages is smaller than the pro-rated >> fraction of effective_cache_size allowed to the index - then the exact >> value doesn't affect the answer. >> >> I apparently need to study this code more. > > FWIW: random_page_cost is meant to be the ratio between the cost of > looking up a single row as and index lookup, and the cost of looking up > that same row as part of a larger sequential scan. For specific > storage, that coefficient should be roughly the same regardless of the > table size. So if your plan for optimization involves manipulating RPC > for anything other than a change of storage, you're Doing It Wrong. > > Instead, we should be fixing the formulas these are based on and leaving > RPC alone. > > For any data page, there are actually four costs associated with each > tuple lookup, per: > > in-memory/seq | on disk/seq > ----------------+---------------- > in-memory/random| on disk/random it lacks some more theorical like sort_page/temp_page : those are based on a ratio of seq_page_cost and random_page_cost or a simple seq_page_cost (when working out of work_mem) memory access is accounted with some 0.1 in some place AFAIR. (and memory random/seq is the same at the level of estimations we do) > > (yes, there's actually more for bitmapscan etc. but the example holds) (if I read correctly the sources, for this one there is a linear approach to ponderate the cost between random_page cost and seq_page_cost on the heap page fetch plus the Mackert and Lohman formula, if needed, in its best usage : predicting what should be in cache *because* of the current query execution, not because of the current status of the page cache) > > For any given tuple lookup, then, you can assign a cost based on where > you think that tuple falls in that quadrant map. Since this is all > probability-based, you'd be assigning a cost as a mixed % of in-memory > and on-disk costs. Improvements in accuracy of this formula would come > through improvements in accuracy in predicting if a particular data page > will be in memory. > > This is what the combination of random_page_cost and > effective_cache_size ought to supply, but I don't think it does, quite. > > -- > Josh Berkus > PostgreSQL Experts Inc. > http://pgexperts.com > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On Fri, May 13, 2011 at 4:13 PM, Josh Berkus <josh@agliodbs.com> wrote: > Instead, we should be fixing the formulas these are based on and leaving > RPC alone. > > For any data page, there are actually four costs associated with each > tuple lookup, per: All true. I suspect that in practice the different between random and sequential memory page costs is small enough to be ignorable, although of course I might be wrong. I've never seen a database that was fully cached in memory where it was necessary to set random_page_cost>seq_page_cost to get good plans -- no doubt partly because even if the pages were consecutive on disk, there's no reason to suppose they would be so in memory, and we certainly wouldn't know one way or the other at planning time. But I agree we should add a cached_page_cost as part of all this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, May 14, 2011 at 3:13 AM, Josh Berkus <josh@agliodbs.com> wrote: > This is what the combination of random_page_cost and > effective_cache_size ought to supply, but I don't think it does, quite. I think random_page_cost causes problems because I need to combine disk random access time, which I can measure, with a guesstimate of the disk cache hit rate. It would be lovely if these two variables were separate. It would be even lovelier if the disk cache hit rate could be probed at run time and didn't need setting at all, but I suspect that isn't possible on some platforms. -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/
Robert, > All true. I suspect that in practice the different between random and > sequential memory page costs is small enough to be ignorable, although > of course I might be wrong. This hasn't been my experience, although I have not carefully measured it. In fact, there's good reason to suppose that, if you were selecting 50% of more of a table, sequential access would still be faster even for an entirely in-memory table. As a parallel to our development, Redis used to store all data as linked lists, making every object lookup effectively a random lookup. They found that even with a database which is pinned in memory, creating a data page structure (they call it "ziplists") and supporting sequential scans was up to 10X faster for large lists. So I would assume that there is still a coefficient difference between seeks and scans in memory until proven otherwise. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
Stuart, > I think random_page_cost causes problems because I need to combine > disk random access time, which I can measure, with a guesstimate of > the disk cache hit rate. See, that's wrong. Disk cache hit rate is what effective_cache_size (ECS) is for. Really, there's several factors which should be going into the planner's estimates to determine a probability of a table being cached: * ratio between total database size and ECS * ratio between table size and ECS * ratio between index size and ECS * whether the table is "hot" or not * whether the index is "hot" or not The last two statistics are critically important for good estimation, and they are not things we currently collect. By "hot" I mean: is this a relation which is accessed several times per minute/hour and is thus likely to be in the cache when we need it? Currently, we have no way of knowing that. Without "hot" statistics, we're left with guessing based on size, which results in bad plans for small tables in large databases which are accessed infrequently. Mind you, for large tables it would be even better to go beyond that and actually have some knowledge of which disk pages might be in cache. However, I think that's beyond feasibility for current software/OSes. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Sun, May 15, 2011 at 2:08 PM, Josh Berkus <josh@agliodbs.com> wrote: >> All true. I suspect that in practice the different between random and >> sequential memory page costs is small enough to be ignorable, although >> of course I might be wrong. > > This hasn't been my experience, although I have not carefully measured > it. In fact, there's good reason to suppose that, if you were selecting > 50% of more of a table, sequential access would still be faster even for > an entirely in-memory table. > > As a parallel to our development, Redis used to store all data as linked > lists, making every object lookup effectively a random lookup. They > found that even with a database which is pinned in memory, creating a > data page structure (they call it "ziplists") and supporting sequential > scans was up to 10X faster for large lists. > > So I would assume that there is still a coefficient difference between > seeks and scans in memory until proven otherwise. Well, anything's possible. But I wonder whether the effects you are describing might result from a reduction in the *number* of pages accessed rather than a change in the access pattern. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/5/15 Josh Berkus <josh@agliodbs.com>: > Stuart, > >> I think random_page_cost causes problems because I need to combine >> disk random access time, which I can measure, with a guesstimate of >> the disk cache hit rate. > > See, that's wrong. Disk cache hit rate is what effective_cache_size > (ECS) is for. > > Really, there's several factors which should be going into the planner's > estimates to determine a probability of a table being cached: > > * ratio between total database size and ECS > * ratio between table size and ECS > * ratio between index size and ECS > * whether the table is "hot" or not > * whether the index is "hot" or not > > The last two statistics are critically important for good estimation, > and they are not things we currently collect. By "hot" I mean: is this > a relation which is accessed several times per minute/hour and is thus > likely to be in the cache when we need it? Currently, we have no way of > knowing that. > > Without "hot" statistics, we're left with guessing based on size, which > results in bad plans for small tables in large databases which are > accessed infrequently. > > Mind you, for large tables it would be even better to go beyond that and > actually have some knowledge of which *which* ? do you mean 'area' of the tables ? > disk pages might be in cache. > However, I think that's beyond feasibility for current software/OSes. maybe not :) mincore is available in many OSes, and windows have options to get those stats too. > > -- > Josh Berkus > PostgreSQL Experts Inc. > http://pgexperts.com > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On 16/05/11 05:45, Cédric Villemain wrote: > 2011/5/15 Josh Berkus <josh@agliodbs.com>: >> disk pages might be in cache. >> However, I think that's beyond feasibility for current software/OSes. > > maybe not :) mincore is available in many OSes, and windows have > options to get those stats too. AFAIK, mincore() is only useful for mmap()ed files and for finding out if it's safe to access certain blocks of memory w/o risking triggering heavy swapping. It doesn't provide any visibility into the OS's block device / file system caches; you can't ask it "how much of this file is cached in RAM" or "is this range of blocks in this file cached in RAM". Even if you could, it's hard to see how an approach that relied on asking the OS through system calls about the cache state when planning every query could be fast enough to be viable. -- Craig Ringer
Craig Ringer wrote: > AFAIK, mincore() is only useful for mmap()ed files and for finding out > if it's safe to access certain blocks of memory w/o risking triggering > heavy swapping. > > It doesn't provide any visibility into the OS's block device / file > system caches; you can't ask it "how much of this file is cached in RAM" > or "is this range of blocks in this file cached in RAM". > You should try out pgfincore if you think this can't be done! > Even if you could, it's hard to see how an approach that relied on > asking the OS through system calls about the cache state when planning > every query could be fast enough to be viable. > You can't do it in real-time. You don't necessarily want that to even if it were possible; too many possibilities for nasty feedback loops where you always favor using some marginal index that happens to be in memory, and therefore never page in things that would be faster once they're read. The only reasonable implementation that avoids completely unstable plans is to scan this data periodically and save some statistics on it--the way ANALYZE does--and then have that turn into a planner input. The related secondary idea of just making assumptions about small tables/indexes, too, may be a useful heuristic to layer on top of this. There's a pile of ideas here that all seem reasonable both in terms of modeling real-world behavior and as things that could be inserted into the optimizer. As usual, I suspect that work is needs to be followed by a giant testing exercise though. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On 2011-05-16 03:18, Greg Smith wrote:
> You can't do it in real-time. You don't necessarily want that to
> even if it were possible; too many possibilities for nasty feedback
> loops where you always favor using some marginal index that happens
> to be in memory, and therefore never page in things that would be
> faster once they're read. The only reasonable implementation that
> avoids completely unstable plans is to scan this data periodically
> and save some statistics on it--the way ANALYZE does--and then have
> that turn into a planner input.
Would that be feasible? Have process collecting the data every now-and-then
probably picking some conservative-average function and feeding
it into pg_stats for each index/relation?
To me it seems like a robust and fairly trivial way to to get better numbers. The
fear is that the OS-cache is too much in flux to get any stable numbers out
of it.
--
Jesper
> You can't do it in real-time. You don't necessarily want that to
> even if it were possible; too many possibilities for nasty feedback
> loops where you always favor using some marginal index that happens
> to be in memory, and therefore never page in things that would be
> faster once they're read. The only reasonable implementation that
> avoids completely unstable plans is to scan this data periodically
> and save some statistics on it--the way ANALYZE does--and then have
> that turn into a planner input.
Would that be feasible? Have process collecting the data every now-and-then
probably picking some conservative-average function and feeding
it into pg_stats for each index/relation?
To me it seems like a robust and fairly trivial way to to get better numbers. The
fear is that the OS-cache is too much in flux to get any stable numbers out
of it.
--
Jesper
On 2011-05-16 06:41, Jesper Krogh wrote: > On 2011-05-16 03:18, Greg Smith wrote: >> You can't do it in real-time. You don't necessarily want that to >> even if it were possible; too many possibilities for nasty feedback >> loops where you always favor using some marginal index that happens >> to be in memory, and therefore never page in things that would be >> faster once they're read. The only reasonable implementation that >> avoids completely unstable plans is to scan this data periodically >> and save some statistics on it--the way ANALYZE does--and then have >> that turn into a planner input. > > Would that be feasible? Have process collecting the data every > now-and-then > probably picking some conservative-average function and feeding > it into pg_stats for each index/relation? > > To me it seems like a robust and fairly trivial way to to get better > numbers. The > fear is that the OS-cache is too much in flux to get any stable > numbers out > of it. Ok, it may not work as well with index'es, since having 1% in cache may very well mean that 90% of all requested blocks are there.. for tables in should be more trivial. -- Jesper
On Mon, May 16, 2011 at 12:49 AM, Jesper Krogh <jesper@krogh.cc> wrote: >> To me it seems like a robust and fairly trivial way to to get better >> numbers. The >> fear is that the OS-cache is too much in flux to get any stable numbers >> out >> of it. > > Ok, it may not work as well with index'es, since having 1% in cache may very > well mean that 90% of all requested blocks are there.. for tables in should > be more trivial. Tables can have hot spots, too. Consider a table that holds calendar reservations. Reservations can be inserted, updated, deleted. But typically, the most recent data will be what is most actively modified, and the older data will be relatively more (though not completely) static, and less frequently accessed. Such examples are common in many real-world applications. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, May 16, 2011 at 12:49 AM, Jesper Krogh <jesper@krogh.cc> wrote: >> Ok, it may not work as well with index'es, since having 1% in cache may very >> well mean that 90% of all requested blocks are there.. for tables in should >> be more trivial. > Tables can have hot spots, too. Consider a table that holds calendar > reservations. Reservations can be inserted, updated, deleted. But > typically, the most recent data will be what is most actively > modified, and the older data will be relatively more (though not > completely) static, and less frequently accessed. Such examples are > common in many real-world applications. Yes. I'm not convinced that measuring the fraction of a table or index that's in cache is really going to help us much. Historical cache hit rates might be useful, but only to the extent that the incoming query has a similar access pattern to those in the (recent?) past. It's not an easy problem. I almost wonder if we should not try to measure this at all, but instead let the DBA set a per-table or per-index number to use, analogous to the override we added recently for column n-distinct statistics ... regards, tom lane
On Sun, May 15, 2011 at 9:49 PM, Jesper Krogh <jesper@krogh.cc> wrote: > > Ok, it may not work as well with index'es, since having 1% in cache may very > well mean that 90% of all requested blocks are there.. for tables in should > be more trivial. Why would the index have a meaningful hot-spot unless the underlying table had one as well? (Of course the root block will be a hot-spot, but certainly not 90% of all requests) Cheers, Jeff
Jeff Janes <jeff.janes@gmail.com> writes: > On Sun, May 15, 2011 at 9:49 PM, Jesper Krogh <jesper@krogh.cc> wrote: >> Ok, it may not work as well with index'es, since having 1% in cache may very >> well mean that 90% of all requested blocks are there.. for tables in should >> be more trivial. > Why would the index have a meaningful hot-spot unless the underlying > table had one as well? (Of course the root block will be a hot-spot, > but certainly not 90% of all requests) The accesses to an index are far more likely to be clustered than the accesses to the underlying table, because the index is organized in a way that's application-meaningful and the table not so much. Continuing the earlier example of a timestamp column, accesses might preferentially hit near the right end of the index while the underlying rows are all over the table. IOW, hot spots measured at the row level and hot spots measured at the page level could very easily be different between table and index. regards, tom lane
> The accesses to an index are far more likely to be clustered than the > accesses to the underlying table, because the index is organized in a > way that's application-meaningful and the table not so much. So, to clarify, are you saying that if query were actually requesting rows uniformly random, then there would be no reason to suspect that index accesses would have hotspots? It seems like the index structure ( ie, the top node in b-trees ) could also get in the way. Best, Nathan
Nathan Boley <npboley@gmail.com> writes: >> The accesses to an index are far more likely to be clustered than the >> accesses to the underlying table, because the index is organized in a >> way that's application-meaningful and the table not so much. > So, to clarify, are you saying that if query were actually requesting > rows uniformly random, then there would be no reason to suspect that > index accesses would have hotspots? It seems like the index structure > ( ie, the top node in b-trees ) could also get in the way. The upper nodes would tend to stay in cache, yes, but we already assume that in the index access cost model, in a kind of indirect way: the model only considers leaf-page accesses in the first place ;-) regards, tom lane
On May 16, 2011, at 10:46 AM, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, May 16, 2011 at 12:49 AM, Jesper Krogh <jesper@krogh.cc> wrote: >>> Ok, it may not work as well with index'es, since having 1% in cache may very >>> well mean that 90% of all requested blocks are there.. for tables in should >>> be more trivial. > >> Tables can have hot spots, too. Consider a table that holds calendar >> reservations. Reservations can be inserted, updated, deleted. But >> typically, the most recent data will be what is most actively >> modified, and the older data will be relatively more (though not >> completely) static, and less frequently accessed. Such examples are >> common in many real-world applications. > > Yes. I'm not convinced that measuring the fraction of a table or index > that's in cache is really going to help us much. Historical cache hit > rates might be useful, but only to the extent that the incoming query > has a similar access pattern to those in the (recent?) past. It's not > an easy problem. > > I almost wonder if we should not try to measure this at all, but instead > let the DBA set a per-table or per-index number to use, analogous to the > override we added recently for column n-distinct statistics ... I think the challenge there would be how to define the scope of the hot-spot. Is it the last X pages? Last X serial values?Something like correlation? Hmm... it would be interesting if we had average relation access times for each stats bucket on a per-column basis; thatwould give the planner a better idea of how much IO overhead there would be for a given WHERE clause. -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
Jim Nasby wrote: > I think the challenge there would be how to define the scope of the hot-spot. Is it the last X pages? Last X serial values?Something like correlation? > > Hmm... it would be interesting if we had average relation access times for each stats bucket on a per-column basis; thatwould give the planner a better idea of how much IO overhead there would be for a given WHERE clause You've already given one reasonable first answer to your question here. If you defined a usage counter for each histogram bucket, and incremented that each time something from it was touched, that could lead to a very rough way to determine access distribution. Compute a ratio of the counts in those buckets, then have an estimate of the total cached percentage; multiplying the two will give you an idea how much of that specific bucket might be in memory. It's not perfect, and you need to incorporate some sort of aging method to it (probably weighted average based), but the basic idea could work. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Jim Nasby wrote: >> I think the challenge there would be how to define the scope of the >> hot-spot. Is it the last X pages? Last X serial values? Something like >> correlation? >> >> Hmm... it would be interesting if we had average relation access times for >> each stats bucket on a per-column basis; that would give the planner a >> better idea of how much IO overhead there would be for a given WHERE clause > > You've already given one reasonable first answer to your question here. If > you defined a usage counter for each histogram bucket, and incremented that > each time something from it was touched, that could lead to a very rough way > to determine access distribution. Compute a ratio of the counts in those > buckets, then have an estimate of the total cached percentage; multiplying > the two will give you an idea how much of that specific bucket might be in > memory. It's not perfect, and you need to incorporate some sort of aging > method to it (probably weighted average based), but the basic idea could > work. Maybe I'm missing something here, but it seems like that would be nightmarishly slow. Every time you read a tuple, you'd have to look at every column of the tuple and determine which histogram bucket it was in (or, presumably, which MCV it is, since those aren't included in working out the histogram buckets). That seems like it would slow down a sequential scan by at least 10x. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On May 19, 2011, at 9:53 AM, Robert Haas wrote: > On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg@2ndquadrant.com> wrote: >> Jim Nasby wrote: >>> I think the challenge there would be how to define the scope of the >>> hot-spot. Is it the last X pages? Last X serial values? Something like >>> correlation? >>> >>> Hmm... it would be interesting if we had average relation access times for >>> each stats bucket on a per-column basis; that would give the planner a >>> better idea of how much IO overhead there would be for a given WHERE clause >> >> You've already given one reasonable first answer to your question here. If >> you defined a usage counter for each histogram bucket, and incremented that >> each time something from it was touched, that could lead to a very rough way >> to determine access distribution. Compute a ratio of the counts in those >> buckets, then have an estimate of the total cached percentage; multiplying >> the two will give you an idea how much of that specific bucket might be in >> memory. It's not perfect, and you need to incorporate some sort of aging >> method to it (probably weighted average based), but the basic idea could >> work. > > Maybe I'm missing something here, but it seems like that would be > nightmarishly slow. Every time you read a tuple, you'd have to look > at every column of the tuple and determine which histogram bucket it > was in (or, presumably, which MCV it is, since those aren't included > in working out the histogram buckets). That seems like it would slow > down a sequential scan by at least 10x. You definitely couldn't do it real-time. But you might be able to copy the tuple somewhere and have a background processdo the analysis. That said, it might be more productive to know what blocks are available in memory and use correlation to guesstimate whethera particular query will need hot or cold blocks. Or perhaps we create a different structure that lets you track thedistribution of each column linearly through the table; something more sophisticated than just using correlation.... perhapssomething like indicating which stats bucket was most prevalent in each block/range of blocks in a table. That informationwould allow you to estimate exactly what blocks in the table you're likely to need... -- Jim C. Nasby, Database Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Thu, May 19, 2011 at 2:39 PM, Jim Nasby <jim@nasby.net> wrote: > On May 19, 2011, at 9:53 AM, Robert Haas wrote: >> On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg@2ndquadrant.com> wrote: >>> Jim Nasby wrote: >>>> I think the challenge there would be how to define the scope of the >>>> hot-spot. Is it the last X pages? Last X serial values? Something like >>>> correlation? >>>> >>>> Hmm... it would be interesting if we had average relation access times for >>>> each stats bucket on a per-column basis; that would give the planner a >>>> better idea of how much IO overhead there would be for a given WHERE clause >>> >>> You've already given one reasonable first answer to your question here. If >>> you defined a usage counter for each histogram bucket, and incremented that >>> each time something from it was touched, that could lead to a very rough way >>> to determine access distribution. Compute a ratio of the counts in those >>> buckets, then have an estimate of the total cached percentage; multiplying >>> the two will give you an idea how much of that specific bucket might be in >>> memory. It's not perfect, and you need to incorporate some sort of aging >>> method to it (probably weighted average based), but the basic idea could >>> work. >> >> Maybe I'm missing something here, but it seems like that would be >> nightmarishly slow. Every time you read a tuple, you'd have to look >> at every column of the tuple and determine which histogram bucket it >> was in (or, presumably, which MCV it is, since those aren't included >> in working out the histogram buckets). That seems like it would slow >> down a sequential scan by at least 10x. > > You definitely couldn't do it real-time. But you might be able to copy the tuple somewhere and have a background processdo the analysis. > > That said, it might be more productive to know what blocks are available in memory and use correlation to guesstimate whethera particular query will need hot or cold blocks. Or perhaps we create a different structure that lets you track thedistribution of each column linearly through the table; something more sophisticated than just using correlation.... perhapssomething like indicating which stats bucket was most prevalent in each block/range of blocks in a table. That informationwould allow you to estimate exactly what blocks in the table you're likely to need... Well, all of that stuff sounds impractically expensive to me... but I just work here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2011/5/19 Jim Nasby <jim@nasby.net>: > On May 19, 2011, at 9:53 AM, Robert Haas wrote: >> On Wed, May 18, 2011 at 11:00 PM, Greg Smith <greg@2ndquadrant.com> wrote: >>> Jim Nasby wrote: >>>> I think the challenge there would be how to define the scope of the >>>> hot-spot. Is it the last X pages? Last X serial values? Something like >>>> correlation? >>>> >>>> Hmm... it would be interesting if we had average relation access times for >>>> each stats bucket on a per-column basis; that would give the planner a >>>> better idea of how much IO overhead there would be for a given WHERE clause >>> >>> You've already given one reasonable first answer to your question here. If >>> you defined a usage counter for each histogram bucket, and incremented that >>> each time something from it was touched, that could lead to a very rough way >>> to determine access distribution. Compute a ratio of the counts in those >>> buckets, then have an estimate of the total cached percentage; multiplying >>> the two will give you an idea how much of that specific bucket might be in >>> memory. It's not perfect, and you need to incorporate some sort of aging >>> method to it (probably weighted average based), but the basic idea could >>> work. >> >> Maybe I'm missing something here, but it seems like that would be >> nightmarishly slow. Every time you read a tuple, you'd have to look >> at every column of the tuple and determine which histogram bucket it >> was in (or, presumably, which MCV it is, since those aren't included >> in working out the histogram buckets). That seems like it would slow >> down a sequential scan by at least 10x. > > You definitely couldn't do it real-time. But you might be able to copy the tuple somewhere and have a background processdo the analysis. > > That said, it might be more productive to know what blocks are available in memory and use correlation to guesstimate whethera particular query will need hot or cold blocks. Or perhaps we create a different structure that lets you track thedistribution of each column linearly through the table; something more sophisticated than just using correlation.... perhapssomething like indicating which stats bucket was most prevalent in each block/range of blocks in a table. That informationwould allow you to estimate exactly what blocks in the table you're likely to need... Those are very good ideas I would get in mind for vacuum/checkpoint tasks: if you are able to know hot and cold data, then order it in the segments of the relation. But making it work at the planner level looks hard. I am not opposed to the idea, but no idea how to do it right now. > -- > Jim C. Nasby, Database Architect jim@nasby.net > 512.569.9461 (cell) http://jim.nasby.net > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
> Well, all of that stuff sounds impractically expensive to me... but I > just work here. I'll point out that the simple version, which just checks for hot tables and indexes, would improve estimates greatly and be a LOT less complicated than these proposals. Certainly having some form of block-based or range-based stats would be better, but it also sounds hard enough to maybe never get done. Having user-accessible "hot" stats would also be useful to DBAs. -- Josh Berkus PostgreSQL Experts Inc. http://pgexperts.com
On Mon, May 23, 2011 at 3:08 PM, Josh Berkus <josh@agliodbs.com> wrote: > >> Well, all of that stuff sounds impractically expensive to me... but I >> just work here. > > I'll point out that the simple version, which just checks for hot tables > and indexes, would improve estimates greatly and be a LOT less > complicated than these proposals. I realize I'm sounding like a broken record here, but as far as I can tell there is absolutely zero evidence that that would be better. I'm sure you're in good company thinking so, but the list of things that could skew (or should I say, screw) the estimates is long and painful; and if those estimates are wrong, you'll end up with something that is both worse and less predictable than the status quo. First, I haven't seen a shred of hard evidence that the contents of the buffer cache or OS cache are stable enough to be relied upon, and we've repeatedly discussed workloads where that might not be true. Has anyone done a systematic study of this on a variety real production systems? If so, the results haven't been posted here, at least not that I can recall. Second, even if we were willing to accept that we could obtain relatively stable and accurate measurements of this data, who is to say that basing plans on it would actually result in an improvement in plan quality? That may seem obvious, but I don't think it is. The proposed method is a bit like trying to determine the altitude of a hot air balloon by throwing the ballast over the side and timing how long it takes to hit the ground. Executing plans that are based on the contents of the cache will change the contents of the cache, which will in turn change the plans. The idea that we can know, without any experimentation, how that's going to shake out, seems to me to be an exercise in unjustified optimism of the first order. Sorry to sound grumpy and pessimistic, but I really think we're letting our enthusiasm get way, way ahead of the evidence. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hello. As of me, all this "hot" thing really looks like uncertain and dynamic enough. Two things that I could directly use right now (and they are needed in pair) are: 1)Per-table/index/database bufferpools (split shared buffer into parts, allow to specify which index/table/database goes where) 2)Per-table/index cost settings If I had this, I could allocate specific bufferpools for tables/indexes that MUST be hot in memory and set low costs for this specific tables. P.S. Third thing, great to have to companion this two is "Load on startup" flag to automatically populate bufferpools with fast sequential read, but this can be easily emulated with a statement. Best regards, Vitalii Tymchyshyn