Thread: How does PG know if data is in memory?
Hi,
After reading lots of documentation, I still don't understand fully how PG knows if some needed data is in memory or in second storage.
While choosing the best query plan, the optimizer must take this into account. Does PG consider this? If so, how does it know?
I presume it checks its shared buffer. But if the data is not in the shared buffer, it could be on OS cache, and the plan would not be optimized.
Can PG check the OS cache? If not, considering a dedicated DB server, is it advised to raise the shared buffer size up to a maximum that does not influence the rest of the system? This way, the shared buffer check would have a higher probability of returning a correct answer.
When setting seq_page_cost and random_page_cost, do I have to consider the probability that data will be in memory? Or does seq_page_cost mean "sequential access on disk" and random_page_cost mean "random access on disk"?
I appreciate if someone could clear this out.
Thanks!
Fabrício dos Anjos Silva
LinkCom Soluções em T.I.
After reading lots of documentation, I still don't understand fully how PG knows if some needed data is in memory or in second storage.
While choosing the best query plan, the optimizer must take this into account. Does PG consider this? If so, how does it know?
I presume it checks its shared buffer. But if the data is not in the shared buffer, it could be on OS cache, and the plan would not be optimized.
Can PG check the OS cache? If not, considering a dedicated DB server, is it advised to raise the shared buffer size up to a maximum that does not influence the rest of the system? This way, the shared buffer check would have a higher probability of returning a correct answer.
When setting seq_page_cost and random_page_cost, do I have to consider the probability that data will be in memory? Or does seq_page_cost mean "sequential access on disk" and random_page_cost mean "random access on disk"?
I appreciate if someone could clear this out.
Thanks!
Fabrício dos Anjos Silva
LinkCom Soluções em T.I.
Fabrício dos Anjos Silva<fabricio.silva@linkcom.com.br> wrote: > After reading lots of documentation, I still don't understand > fully how PG knows if some needed data is in memory or in second > storage. > Does PG consider this? No. > When setting seq_page_cost and random_page_cost, do I have to > consider the probability that data will be in memory? Yes. -Kevin
2010/9/29 Fabrício dos Anjos Silva <fabricio.silva@linkcom.com.br>
When setting seq_page_cost and random_page_cost, do I have to consider the probability that data will be in memory? Or does seq_page_cost mean "sequential access on disk" and random_page_cost mean "random access on disk"?
The reason seq_page_cost and random_page_cost exist as parameters is so that you can inform the optimizer what the relative costs of those actions are, which is directly related to the expected size of the filesystem cache, ratio of total db size to available cache memory, and the performance of your disk i/o subsystems (and any other disk-related work the host may be doing). effective_cache_size allows you to tell postgres how big you believe all available cache memory is - shared_buffers and OS cache.
As to your question about increasing shared_buffers to be some significant proportion of available RAM - apparently, that is not a good idea. I've seen advice that said you shouldn't go above 8GB for shared_buffers and I've also seen 12GB suggested as an upper limit, too. On my host with 48GB of RAM, I didn't see much difference between 8GB and 12GB on a fairly wide variety of tests, so mine is set at 8GB with an efective_cache_size of 36GB.
I appreciate if someone could clear this out.
Thanks!
Fabrício dos Anjos Silva
LinkCom Soluções em T.I.
Thank you all for the replies.
If PG does not know whether needed data is in memory, how does it estimate cost? There is a huge difference between access time in memory and in secondary storage. Not taking this into account results in almost "useless" estimates. I am not saying that PG does a pour job, but I've been using it for 4 years and from time to time I notice very pour estimates. After some testing 2 years ago, the only configuration I could manage to use was to tell PG to avoid Seq Scan and Index Scans. I know that in many situations these techniques are the best to choose, but when they are chosen where they are not suitable, I get very bad plans.
Recently, I faced poor performance again, but this time because we started to work with larger tables (10M rows). This encourage me to study PG tuning again, trying to understand how the planner works and trying to get the best of it. Unfortunately, it does not seem to be an easy task.
If someone could point good books about PG tuning, I would appreciate that. I found some yet to be released books about PG 9. Any comments about them?
Thank you all.
Fabrício dos Anjos Silva
LinkCom Soluções em T.I.
Em 29 de setembro de 2010 14:08, Samuel Gendler <sgendler@ideasculptor.com> escreveu:
2010/9/29 Fabrício dos Anjos Silva <fabricio.silva@linkcom.com.br>
When setting seq_page_cost and random_page_cost, do I have to consider the probability that data will be in memory? Or does seq_page_cost mean "sequential access on disk" and random_page_cost mean "random access on disk"?The reason seq_page_cost and random_page_cost exist as parameters is so that you can inform the optimizer what the relative costs of those actions are, which is directly related to the expected size of the filesystem cache, ratio of total db size to available cache memory, and the performance of your disk i/o subsystems (and any other disk-related work the host may be doing). effective_cache_size allows you to tell postgres how big you believe all available cache memory is - shared_buffers and OS cache.As to your question about increasing shared_buffers to be some significant proportion of available RAM - apparently, that is not a good idea. I've seen advice that said you shouldn't go above 8GB for shared_buffers and I've also seen 12GB suggested as an upper limit, too. On my host with 48GB of RAM, I didn't see much difference between 8GB and 12GB on a fairly wide variety of tests, so mine is set at 8GB with an efective_cache_size of 36GB.I appreciate if someone could clear this out.
Thanks!
Fabrício dos Anjos Silva
LinkCom Soluções em T.I.
On 1/10/2010 7:12 PM, Fabrício dos Anjos Silva wrote: > > Thank you all for the replies. > > If PG does not know whether needed data is in memory, how does it > estimate cost? There is a huge difference between access time in memory > and in secondary storage. Not taking this into account results in almost > "useless" estimates. It's generally configured with the conservative assumption that data will have to come from disk. Note that the query planner's job isn't to figure out how long the query will take. It's to compare various possible query plans and decide which will be fastest. There are certainly cases where knowing what's cached would help with this - for example: if an index is cached but the table data isn't, it's more likely to be worth using the index to reduce disk reads. But I don't know just how much difference it really makes. Because the query often only wants a small subset of the data, and whole relations are rarely fully cached, it's not enough to know that "some of relation X is cached", it has to know if the cached parts are the parts that'll be required, or at least an approximation of that. It sounds horrendously complicated to keep track of to me, and in the end it won't make query execution any faster, it'll just potentially help the planner pick a better plan. I wonder if that'd be worth the extra CPU time spent managing the cache and cache content stats, and using those cache stats when planning? It'd be an interesting experiment, but the outcome is hardly obvious. As you can see, I don't really agree that the planner's estimates are useless just because it's not very aware of the cache's current contents. It has a pretty good idea of the system's memory and how much of that can be used for cache, and knows how big various indexes and relations are. That seems to work pretty well. If some kind of cache awareness was to be added, I'd be interested in seeing a "hotness" measure that tracked how heavily a given relation/index has been accessed and how much has been read from it recently. A sort of age-scaled blocks-per-second measure that includes both cached and uncached (disk) reads. This would let the planner know how likely parts of a given index/relation are to be cached in memory without imposing the cost of tracking the cache in detail. I'm still not sure it'd be all that useful, though... > I am not saying that PG does a pour job, but I've > been using it for 4 years and from time to time I notice very pour > estimates. Most of the issues reported here, at least, are statistics issues, rather than lack of knowledge about cache status. The planner thinks it'll find (say) 2 tuples maching a filter, and instead finds 100,000, so it chooses a much less efficient join type. That sort of thing is really independent of the cache state. > Recently, I faced poor performance again, but this time because we > started to work with larger tables (10M rows). This encourage me to > study PG tuning again, trying to understand how the planner works and > trying to get the best of it. Unfortunately, it does not seem to be an > easy task. No argument there! Like any database there's a fair bit of black magic involved, and a whole lot of benchmarking. The key thing is to have appropriate statistics (usually high), get a reasonable random_page_cost and seq_page_cost, to set your effective cache size appropriately, and to set reasonable work_mem. "Reasonable" is hard to work out for work_mem, because Pg's work_mem limit is per-sort (etc) not per-query or per-backend. I understand that making it per-query is way, way harder than it sounds at face value, though, so we must make do. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
Craig Ringer <craig@postnewspapers.com.au> wrote: > Because the query often only wants a small subset of the data, and > whole relations are rarely fully cached, it's not enough to know > that "some of relation X is cached", it has to know if the cached > parts are the parts that'll be required, or at least an > approximation of that. It sounds horrendously complicated to keep > track of to me, and in the end it won't make query execution any > faster, it'll just potentially help the planner pick a better > plan. I wonder if that'd be worth the extra CPU time spent > managing the cache and cache content stats, and using those cache > stats when planning? It'd be an interesting experiment, but the > outcome is hardly obvious. I agree with that, but I think there's an even more insidious issue here. Biasing plans heavily toward using what is already in cache could have a destabilizing effect on performance. Let's say that some query or maintenance skews the cache toward some plan which is much slower when cached than another plan would be if cached. Let's also postulate that this query runs very frequently. It will always settle for what's fastest *this* time, not what would make for fastest performance if consistently used. If it never chooses the plan which would run better if cached, the data used for that plan may never make it into cache, and you will limp along with the inferior plan forever. If you set the overall level of caching you expect, the optimizer will tend to wind up with data cached to support the optimal plans for that level of caching for the frequently run queries. -Kevin
> It sounds horrendously complicated to keep track of to me, and in the > end it won't make query execution any faster, it'll just potentially > help the planner pick a better plan. I wonder if that'd be worth the > extra CPU time spent managing the cache and cache content stats, and > using those cache stats when planning? It'd be an interesting > experiment, but the outcome is hardly obvious. Well, suppose you pick an index scan, the only way to know which index (and heap) pages you'll need is to actually do the index scan... which isn't really something you'd do when planning. So you scan,
Craig,
I agree with you. Not completely, but I do.
I'm just stuck in a situation where I can't figure out what values to use for the parameters. I can't even think of a way on how to test and discover that.
I followed Josh Berkus' GUC spreadsheet and some tutorials on PG wiki, but how do I test if my configuration is good or bad? I see in PG log that some queries have bad plans, but should I do in order to tell PG to make better decisions? I tried different values with no success.
I understand that parameters have no "work everywhere" values. Each database has its characteristics and each server has its HW specifications.
Is there any automated test tool? A can compile a list of real-world queries, and provide an exact copy of my db server just for testing. But how do I do it? Write a bunch of scripts? Is there any serious tool that try different parameters, run a load test, process results and generate reports?
Again, thanks all of you for the replies.
Cheers,
Fabrício dos Anjos Silva
LinkCom Soluções em T.I.
I agree with you. Not completely, but I do.
I'm just stuck in a situation where I can't figure out what values to use for the parameters. I can't even think of a way on how to test and discover that.
I followed Josh Berkus' GUC spreadsheet and some tutorials on PG wiki, but how do I test if my configuration is good or bad? I see in PG log that some queries have bad plans, but should I do in order to tell PG to make better decisions? I tried different values with no success.
I understand that parameters have no "work everywhere" values. Each database has its characteristics and each server has its HW specifications.
Is there any automated test tool? A can compile a list of real-world queries, and provide an exact copy of my db server just for testing. But how do I do it? Write a bunch of scripts? Is there any serious tool that try different parameters, run a load test, process results and generate reports?
Again, thanks all of you for the replies.
Cheers,
Fabrício dos Anjos Silva
LinkCom Soluções em T.I.
2010/10/1 Kevin Grittner <Kevin.Grittner@wicourts.gov>
Craig Ringer <craig@postnewspapers.com.au> wrote:I agree with that, but I think there's an even more insidious issue
> Because the query often only wants a small subset of the data, and
> whole relations are rarely fully cached, it's not enough to know
> that "some of relation X is cached", it has to know if the cached
> parts are the parts that'll be required, or at least an
> approximation of that. It sounds horrendously complicated to keep
> track of to me, and in the end it won't make query execution any
> faster, it'll just potentially help the planner pick a better
> plan. I wonder if that'd be worth the extra CPU time spent
> managing the cache and cache content stats, and using those cache
> stats when planning? It'd be an interesting experiment, but the
> outcome is hardly obvious.
here. Biasing plans heavily toward using what is already in cache
could have a destabilizing effect on performance. Let's say that
some query or maintenance skews the cache toward some plan which is
much slower when cached than another plan would be if cached. Let's
also postulate that this query runs very frequently. It will always
settle for what's fastest *this* time, not what would make for
fastest performance if consistently used. If it never chooses the
plan which would run better if cached, the data used for that plan
may never make it into cache, and you will limp along with the
inferior plan forever.
If you set the overall level of caching you expect, the optimizer
will tend to wind up with data cached to support the optimal plans
for that level of caching for the frequently run queries.
-Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > I agree with that, but I think there's an even more insidious issue > here. Biasing plans heavily toward using what is already in cache > could have a destabilizing effect on performance. Not to mention the destabilizing effect on the plans themselves. Behavior like that would make EXPLAIN nearly useless, because the plan you get would vary from moment to moment even when "nothing is changing". It's fairly clear that people don't actually want that. regards, tom lane
2010/10/1 Fabrício dos Anjos Silva <fabricio.silva@linkcom.com.br>
Craig,
I agree with you. Not completely, but I do.
I'm just stuck in a situation where I can't figure out what values to use for the parameters. I can't even think of a way on how to test and discover that.
I followed Josh Berkus' GUC spreadsheet and some tutorials on PG wiki, but how do I test if my configuration is good or bad? I see in PG log that some queries have bad plans, but should I do in order to tell PG to make better decisions? I tried different values with no success.You can set different values for most configuration params on individual db connections. You can test different values for individual slow-running queries. Rather than disabling whole features in the entire database - which may well make lots of other queries run less quickly - you can, at the very least, just disable those features before running the queries that are known to be slow and for which you could not find global values which worked well. Disable sequence plans just before running query x, or boost work_mem to a very high value just for query y. It is also possible that you've simply outstripped your hardware's capability. We had a database with a number of tables containing tens of millions of rows and queries which frequently required aggregating over whole tables. Moving from 8Gb of RAM to 48GB of RAM (so that a large chunk of the db fits in memory) and from 6 spindles to 12, and then just modifying the global config to suit the new hardware gave us a huge performance boost that we could never have gotten on the old hardware, no matter how much tuning of individual queries we did. I was actually able to drop all of the custom config tweaks that we had on individual queries, though I'm sure I'll eventually wind up adding some back - queries that aggregate over large tables really benefit from a lot of work_mem - more than I want to configure globally.
On 3/10/2010 7:39 AM, Richard Troy wrote: > I can't speak for modern "OpenVMS", but "back in the day", VMS had a very > effective memory management strategy which, in effect, made it as if all > memory was a cache for disk. It did this by means of a mechanism by which > to identify all potentially reachable disk space. When disk was read in, > an entry would be made mapping the memory to the disk space from which it > came - and if it was later updated, the mapping entry was marked "dirty." > Whenever disk access was contemplated, a check was made to see if it was > already in memory and if so, it'd provide access to the in-memory copy > instead of doing the read again. (This also permitted, under some > circumstances, to reduce write activity as well.) That's how Linux's memory management works, too, at least if I understand you correctly. Pretty much every modern OS does it. Pg is reliant on the operating system's disk cache, and has some minimal knowledge of it (see effective_cache_size) . I don't know how shared_buffers management works, but certainly at the OS cache level that's what already happens. -- Craig Ringer Tech-related writing at http://soapyfrogs.blogspot.com/
Samuel Gendler wrote: > As to your question about increasing shared_buffers to be some > significant proportion of available RAM - apparently, that is not a > good idea. I've seen advice that said you shouldn't go above 8GB for > shared_buffers and I've also seen 12GB suggested as an upper limit, > too. On my host with 48GB of RAM, I didn't see much difference > between 8GB and 12GB on a fairly wide variety of tests, so mine is set > at 8GB with an efective_cache_size of 36GB. The publicly discussed tests done at Sun suggested 10GB was the effective upper limit on Solaris before performance started dropping instead of increasing on some of their internal benchmarks. And I've heard privately from two people who have done similar experiments on Linux and found closer to 8GB to be the point where performance started to drop. I'm hoping to get some hardware capable of providing some more public results in this area, and some improvements if we can get better data about what causes this drop in efficiency. Given that some write-heavy workloads start to suffer considerable checkpoint issues when shared_buffers is set to a really high value, there's at least two reasons to be conservative here. The big win is going from the tiny default to hundreds of megabytes. Performance keeps going up for many people into the low gigabytes range, but the odds of hitting a downside increase too. Since PostgreSQL uses the OS cache, too, I see some sytems with a whole lot of RAM where the 512MB - 1GB range still ends up being optimal, just in terms of balancing the improvements you get from things being in the cache vs. the downsides of heavy checkpoint writes. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
Fabrício dos Anjos Silva wrote: > After reading lots of documentation, I still don't understand fully > how PG knows if some needed data is in memory or in second storage. > While choosing the best query plan, the optimizer must take this into > account. Does PG consider this? If so, how does it know? There are actually two different questions here, and I'm not sure both have been completely clarified for you by the discussion yet. PostgreSQL has its own dedicated pool of memory, sized by shared_buffers. When you request a page, if it's already in memory, you get pointed toward that copy and no physical I/O happens. If it's not, PostgreSQL asks the OS to read the page. It's possible that will return a page that's in the OS cache; the database currently has no idea when this does or doesn't happen though. The hit % counters in the database only reflect shared_buffers hits, not OS ones. Some work to integrate the OS cache information into the database has been done, the current leading project in that area is pgfincore: http://pgfoundry.org/projects/pgfincore/ However, none of this information is considered at all by the query optimizer. It makes plans without any knowledge of what is or isn't in RAM right now, either the dedicated database memory or the OS cache. Only the ratios of the planner constants are really considered. You can set those on a query by query basis to provide subtle hints when you know something the planner doesn't, but you have to be very careful about doing that as those plans tend to get obsolete eventually when you do that trick. I had a brain-storming session on this subject with a few of the hackers in the community in this area a while back I haven't had a chance to do something with yet (it exists only as a pile of scribbled notes so far). There's a couple of ways to collect data on what's in the database and OS cache, and a couple of ways to then expose that data to the optimizer. But that needs to be done very carefully, almost certainly as only a manual process at first, because something that's producing cache feedback all of the time will cause plans to change all the time, too. Where I suspect this is going is that we may end up tracking various statistics over time, then periodically providing a way to export a mass of "typical % cached" data back to the optimizer for use in plan cost estimation purposes. But the idea of monitoring continuously and always planning based on the most recent data available has some stability issues, both from a "too many unpredictable plan changes" and a "bad short-term feedback loop" perspective, as mentioned by Tom and Kevin already. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
Craig Ringer wrote: > If some kind of cache awareness was to be added, I'd be interested in > seeing a "hotness" measure that tracked how heavily a given > relation/index has been accessed and how much has been read from it > recently. A sort of age-scaled blocks-per-second measure that includes > both cached and uncached (disk) reads. This would let the planner know > how likely parts of a given index/relation are to be cached in memory > without imposing the cost of tracking the cache in detail. I'm still > not sure it'd be all that useful, though... Yup, that's one of the design ideas scribbled in my notes, as is the idea of what someone dubbed a "heat map" that tracked which parts of the relation where actually the ones in RAM, the other issue you mentioned. The problem facing a lot of development possibilities in this area is that we don't have any continuous benchmarking of complicated plans going on right now. So if something really innovative is done, there's really no automatic way to test the result and then see what types of plans it improves and what it makes worse. Until there's some better performance regression work like that around, development on the optimizer has to favor being very conservative. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
Fabrício dos Anjos Silva wrote: > Is there any automated test tool? A can compile a list of real-world > queries, and provide an exact copy of my db server just for testing. > But how do I do it? Write a bunch of scripts? Is there any serious > tool that try different parameters, run a load test, process results > and generate reports? There's a list of tools for playing back a test workload at http://wiki.postgresql.org/wiki/Statement_Playback I'm not aware of anyone beyond some academic research that has taken that idea and built something to test many database parameter combinations. They did that at http://www.cs.duke.edu/~shivnath/papers/ituned.pdf but I don't think that code went public; I asked about it at one point and never heard anything really useful back. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
Fabrício dos Anjos Silva wrote: > If someone could point good books about PG tuning, I would appreciate > that. I found some yet to be released books about PG 9. Any comments > about them? The largest treatment of the subject already in print I'm aware of is in the Korry and Susan Douglas "PostgreSQL" book from 2005, which has around 30 pages convering PostgreSQL 8.0. That material is very good and much of it still applies, but it's mostly theory without many examples, and there have been a lot of changes to the optimizer since then. There's a good talk by Robert Haas on this subject too that's quite current, you can find his slides at http://sites.google.com/site/robertmhaas/presentations and a recording of one version of him giving it is at http://www.pgcon.org/2010/schedule/events/208.en.html My "PostgreSQL 9.0 High Performance", due out later this month if things continue on schedule, has about 80 pages dedicated to indexing and query optimization (waiting for final typesetting to know the exact count). The main difference with what I do compared to every other treatment I've seen of this subject is I suggest a sample small but not trivial data set, then show plans for real queries run against it. So you should be able to duplicate the examples, and then tinker with them on your own system to see their plans change as you adjust parameters to follow along. That really is the only way to gain expertise here. Starting with the troublesome queries from your live system will work just as well for that, once you get familiar with enough of the basics. I'd suggest taking a look and listen to Robert's talk for now, that will get you started in the right direction, and combine it with reading all of the documentation in the manual on this subject. That should keep you busy for a while, and by the time you're done you may find my book is available too. -- Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book
2010/10/4 Greg Smith <greg@2ndquadrant.com>: > Craig Ringer wrote: >> >> If some kind of cache awareness was to be added, I'd be interested in >> seeing a "hotness" measure that tracked how heavily a given relation/index >> has been accessed and how much has been read from it recently. A sort of >> age-scaled blocks-per-second measure that includes both cached and uncached >> (disk) reads. This would let the planner know how likely parts of a given >> index/relation are to be cached in memory without imposing the cost of >> tracking the cache in detail. I'm still not sure it'd be all that useful, >> though... > > Yup, that's one of the design ideas scribbled in my notes, as is the idea of > what someone dubbed a "heat map" that tracked which parts of the relation > where actually the ones in RAM, the other issue you mentioned. The problem > facing a lot of development possibilities in this area is that we don't have > any continuous benchmarking of complicated plans going on right now. So if > something really innovative is done, there's really no automatic way to test > the result and then see what types of plans it improves and what it makes > worse. Until there's some better performance regression work like that > around, development on the optimizer has to favor being very conservative. * tracking specific block is not very easy because of readahead. You end-up measuring exactly if a block was in memory at the moment you requested it physicaly, not at the moment the first seek/fread happen. It is still interesting stat imho. I wonder how that can add value to the planner. * If the planner knows more about the OS cache it can guess the effective_cache_size on its own, which is probably already nice to have. Extract from postgres code: * We use an approximation proposed by Mackert and Lohman, "Index Scans * Using a Finite LRU Buffer: A Validated I/O Model", ACM Transactions * on Database Systems, Vol. 14, No. 3, September 1989, Pages 401-424. Planner use that in conjunction with effective_cache_size to guess if it is interesting to scan the index. All is to know if this model is still valid in front of a more precise knowledge of the OS page cache... and also if it matches how different systems like windows and linux handle page cache. Hooks around cost estimation should help writing a module to rethink that part of the planner and make it use the statistics about cache. I wonder if adding such hooks to core impact its performances ? Anyway doing that is probably the easier and shorter way to test the behavior. > > -- > Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD > PostgreSQL Training, Services and Support www.2ndQuadrant.us > Author, "PostgreSQL 9.0 High Performance" Pre-ordering at: > https://www.packtpub.com/postgresql-9-0-high-performance/book > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
On 10/04/2010 04:22 AM, Greg Smith wrote: > I had a brain-storming session on this subject with a few of the hackers in the community in this area a while back I haven'thad a chance to do something with yet (it exists only as a pile of scribbled notes so far). There's a couple of waysto collect data on what's in the database and OS cache, and a couple of ways to then expose that data to the optimizer.But that needs to be done very carefully, almost certainly as only a manual process at first, because somethingthat's producing cache feedback all of the time will cause plans to change all the time, too. Where I suspect thisis going is that we may end up tracking various statistics over time, then periodically providing a way to export a massof "typical % cached" data back to the optimizer for use in plan cost estimation purposes. But the idea of monitoringcontinuously and always planning based on the most recent data available has some stability issues, both from a"too many unpredictable plan changes" and a "ba d > short-term feedback loop" perspective, as mentioned by Tom and Kevin already. Why not monitor the distribution of response times, rather than "cached" vs. not? That a) avoids the issue of discovering what was a cache hit b) deals neatly with multilevel caching c) feeds directly into cost estimation. Cheers, Jeremy
On Mon, Oct 4, 2010 at 6:47 PM, Jeremy Harris <jgh@wizmail.org> wrote: > On 10/04/2010 04:22 AM, Greg Smith wrote: >> >> I had a brain-storming session on this subject with a few of the hackers >> in the community in this area a while back I haven't had a chance to do >> something with yet (it exists only as a pile of scribbled notes so far). >> There's a couple of ways to collect data on what's in the database and OS >> cache, and a couple of ways to then expose that data to the optimizer. But >> that needs to be done very carefully, almost certainly as only a manual >> process at first, because something that's producing cache feedback all of >> the time will cause plans to change all the time, too. Where I suspect this >> is going is that we may end up tracking various statistics over time, then >> periodically providing a way to export a mass of "typical % cached" data >> back to the optimizer for use in plan cost estimation purposes. But the idea >> of monitoring continuously and always planning based on the most recent data >> available has some stability issues, both from a "too many unpredictable >> plan changes" and a "ba > > d >> >> short-term feedback loop" perspective, as mentioned by Tom and Kevin >> already. > > Why not monitor the distribution of response times, rather than "cached" vs. > not? > > That a) avoids the issue of discovering what was a cache hit b) deals > neatly with > multilevel caching c) feeds directly into cost estimation. I was hot on doing better cache modeling a year or two ago, but the elephant in the room is that it's unclear that it solves any real-world problem. The OP is clearly having a problem, but there's not enough information in his post to say what is actually causing it, and it's probably not caching effects. We get occasional complaints of the form "the first time I run this query it's slow, and then after that it's fast" but, as Craig Ringer pointed out upthread, not too many. And even with respect to the complaints we do get, it's far from clear that the cure is any better than the disease. Taking caching effects into account could easily result in the first execution being slightly less slow and all of the subsequent executions being moderately slow. That would not be an improvement for most people. The reports that seem really painful to me are the ones where people with really big machines complain of needing HOURS for the cache to warm up, and having the system bogged down to a standstill until then. But changing the cost model isn't going to help them either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
An approach that works can be found in DB2, and likely elsewhere. The key is that tablespaces/tables/indexes/buffers are all attached through the bufferpool (the DB2 term). A tablespace/bufferpoolmatch is defined. Then tables and indexes are assigned to the tablespace (and implicitly, the bufferpool). As a result, one can effectively pin data in memory. This is very useful, but not low hanging fruit to implement. The introduction of rudimentary tablespaces is a first step. I assumed that the point was to get to a DB2-like structureat some point. Yes? Robert ---- Original message ---- >Date: Mon, 11 Oct 2010 22:59:28 -0400 >From: pgsql-performance-owner@postgresql.org (on behalf of Robert Haas <robertmhaas@gmail.com>) >Subject: Re: [PERFORM] How does PG know if data is in memory? >To: Jeremy Harris <jgh@wizmail.org> >Cc: pgsql-performance@postgresql.org > >On Mon, Oct 4, 2010 at 6:47 PM, Jeremy Harris <jgh@wizmail.org> wrote: >> On 10/04/2010 04:22 AM, Greg Smith wrote: >>> >>> I had a brain-storming session on this subject with a few of the hackers >>> in the community in this area a while back I haven't had a chance to do >>> something with yet (it exists only as a pile of scribbled notes so far). >>> There's a couple of ways to collect data on what's in the database and OS >>> cache, and a couple of ways to then expose that data to the optimizer. But >>> that needs to be done very carefully, almost certainly as only a manual >>> process at first, because something that's producing cache feedback all of >>> the time will cause plans to change all the time, too. Where I suspect this >>> is going is that we may end up tracking various statistics over time, then >>> periodically providing a way to export a mass of "typical % cached" data >>> back to the optimizer for use in plan cost estimation purposes. But the idea >>> of monitoring continuously and always planning based on the most recent data >>> available has some stability issues, both from a "too many unpredictable >>> plan changes" and a "ba >> >> d >>> >>> short-term feedback loop" perspective, as mentioned by Tom and Kevin >>> already. >> >> Why not monitor the distribution of response times, rather than "cached" vs. >> not? >> >> That a) avoids the issue of discovering what was a cache hit b) deals >> neatly with >> multilevel caching c) feeds directly into cost estimation. > >I was hot on doing better cache modeling a year or two ago, but the >elephant in the room is that it's unclear that it solves any >real-world problem. The OP is clearly having a problem, but there's >not enough information in his post to say what is actually causing it, >and it's probably not caching effects. We get occasional complaints >of the form "the first time I run this query it's slow, and then after >that it's fast" but, as Craig Ringer pointed out upthread, not too >many. And even with respect to the complaints we do get, it's far >from clear that the cure is any better than the disease. Taking >caching effects into account could easily result in the first >execution being slightly less slow and all of the subsequent >executions being moderately slow. That would not be an improvement >for most people. The reports that seem really painful to me are the >ones where people with really big machines complain of needing HOURS >for the cache to warm up, and having the system bogged down to a >standstill until then. But changing the cost model isn't going to >help them either. > >-- >Robert Haas >EnterpriseDB: http://www.enterprisedb.com >The Enterprise PostgreSQL Company > >-- >Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-performance
On Mon, Oct 11, 2010 at 11:11 PM, <gnuoytr@rcn.com> wrote: > An approach that works can be found in DB2, and likely elsewhere. > > The key is that tablespaces/tables/indexes/buffers are all attached through the bufferpool (the DB2 term). A tablespace/bufferpoolmatch is defined. Then tables and indexes are assigned to the tablespace (and implicitly, the bufferpool). As a result, one can effectively pin data in memory. This is very useful, but not low hanging fruit to implement. > > The introduction of rudimentary tablespaces is a first step. I assumed that the point was to get to a DB2-like structureat some point. Yes? We already have tablespaces, and our data already is accessed through the buffer pool. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
The discussions I've seen indicated that, in use, tablespaces were at the database level, but, yes, the docs do say thata table can be assigned to a defined tablespace. What I still can't find is syntax which establishes buffers/caches/whateverand assigns them to tablespaces. Without that, I'm not sure what benefit there is to tablespaces,other than a sort of RAID-lite. Robert ---- Original message ---- >Date: Tue, 12 Oct 2010 08:34:23 -0400 >From: pgsql-performance-owner@postgresql.org (on behalf of Robert Haas <robertmhaas@gmail.com>) >Subject: Re: [PERFORM] How does PG know if data is in memory? >To: gnuoytr@rcn.com >Cc: pgsql-performance@postgresql.org > >On Mon, Oct 11, 2010 at 11:11 PM, <gnuoytr@rcn.com> wrote: >> An approach that works can be found in DB2, and likely elsewhere. >> >> The key is that tablespaces/tables/indexes/buffers are all attached through the bufferpool (the DB2 term). A tablespace/bufferpoolmatch is defined. Then tables and indexes are assigned to the tablespace (and implicitly, the bufferpool). As a result, one can effectively pin data in memory. This is very useful, but not low hanging fruit to implement. >> >> The introduction of rudimentary tablespaces is a first step. I assumed that the point was to get to a DB2-like structureat some point. Yes? > >We already have tablespaces, and our data already is accessed through >the buffer pool. > >-- >Robert Haas >EnterpriseDB: http://www.enterprisedb.com >The Enterprise PostgreSQL Company > >-- >Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-performance
<gnuoytr@rcn.com> wrote: > An approach that works can be found in DB2, and likely elsewhere. > > The key is that tablespaces/tables/indexes/buffers are all > attached through the bufferpool (the DB2 term). A tablespace/ > bufferpool match is defined. Then tables and indexes are assigned > to the tablespace (and implicitly, the bufferpool). As a result, > one can effectively pin data in memory. This is very useful, but > not low hanging fruit to implement. This sounds similar to Sybase named caches. You can segment off portions of the memory for specific caches, break that up into space reserved for different I/O buffer sizes, and bind specific database objects (tables and indexes) to specific caches. On the few occasions where someone had failed to configure the named caches when setting up a machine, it was caught almost immediately after deployment because of end-user complaints about poor performance. This was so critical to performance for us when we were using Sybase, that one of my first reactions on finding it missing in PostgreSQL was distress over the inability to tune as I had. When I posted to the list about it, the response was that LRU eviction was superior to any tuning any human would do. I didn't and don't believe that, but have found it's close enough in the PostgreSQL environment to be *way* down my list of performance issues. In fact, when looking at the marginal benefits it would generate in PostgreSQL when done right, versus the number of people who would shoot themselves in the foot with it, even I have come around to feeling it's probably not a good idea. FWIW, the four main reasons for using it were: (1) Heavily used data could be kept fully cached in RAM and not driven out by transient activity. (2) You could flag a cache used for (1) above as using "relaxed LRU accounting" -- it saved a lot of time tracking repeated references, leaving more CPU for other purposes. (3) Each named cache had its own separate set of locks, reducing contention. (4) Large tables for which the heap was often were scanned in its entirety or for a range on the clustered index could be put in a relatively small cache with large I/O buffers. This avoided blowing out the default cache space for situations which almost always required disk I/O anyway. None of that is anything for amateurs to play with. You need to set up caches like that based on evidence from monitoring and do careful benchmarking of the results to actually achieve improvements over LRU logic. > The introduction of rudimentary tablespaces is a first step. I > assumed that the point was to get to a DB2-like structure at some > point. Yes? As far as I can tell, there is nobody with that intent. -Kevin
Couldn't have said it better myself; covered all the bases. If PG wants to become an industrial strength database, worthyof replacing DB2/etc., then these are the sorts of knobs and switches it will need. -- None of that is anything for amateurs to play with. Not jam a stick in anybody's eye, but shouldn't database pros not be amateurs? Or are most PG-ers coders who don't reallywant to design and tune a database? Robert ---- Original message ---- >Date: Tue, 12 Oct 2010 09:35:56 -0500 >From: pgsql-performance-owner@postgresql.org (on behalf of "Kevin Grittner" <Kevin.Grittner@wicourts.gov>) >Subject: Re: [PERFORM] How does PG know if data is in memory? >To: <pgsql-performance@postgresql.org>,<gnuoytr@rcn.com> > ><gnuoytr@rcn.com> wrote: > >> An approach that works can be found in DB2, and likely elsewhere. >> >> The key is that tablespaces/tables/indexes/buffers are all >> attached through the bufferpool (the DB2 term). A tablespace/ >> bufferpool match is defined. Then tables and indexes are assigned >> to the tablespace (and implicitly, the bufferpool). As a result, >> one can effectively pin data in memory. This is very useful, but >> not low hanging fruit to implement. > >This sounds similar to Sybase named caches. You can segment off >portions of the memory for specific caches, break that up into space >reserved for different I/O buffer sizes, and bind specific database >objects (tables and indexes) to specific caches. On the few >occasions where someone had failed to configure the named caches >when setting up a machine, it was caught almost immediately after >deployment because of end-user complaints about poor performance. >This was so critical to performance for us when we were using >Sybase, that one of my first reactions on finding it missing in >PostgreSQL was distress over the inability to tune as I had. > >When I posted to the list about it, the response was that LRU >eviction was superior to any tuning any human would do. I didn't >and don't believe that, but have found it's close enough in the >PostgreSQL environment to be *way* down my list of performance >issues. In fact, when looking at the marginal benefits it would >generate in PostgreSQL when done right, versus the number of people >who would shoot themselves in the foot with it, even I have come >around to feeling it's probably not a good idea. > >FWIW, the four main reasons for using it were: > >(1) Heavily used data could be kept fully cached in RAM and not >driven out by transient activity. > >(2) You could flag a cache used for (1) above as using "relaxed LRU >accounting" -- it saved a lot of time tracking repeated references, >leaving more CPU for other purposes. > >(3) Each named cache had its own separate set of locks, reducing >contention. > >(4) Large tables for which the heap was often were scanned in its >entirety or for a range on the clustered index could be put in a >relatively small cache with large I/O buffers. This avoided blowing >out the default cache space for situations which almost always >required disk I/O anyway. > >None of that is anything for amateurs to play with. You need to set >up caches like that based on evidence from monitoring and do careful >benchmarking of the results to actually achieve improvements over >LRU logic. > >> The introduction of rudimentary tablespaces is a first step. I >> assumed that the point was to get to a DB2-like structure at some >> point. Yes? > >As far as I can tell, there is nobody with that intent. > >-Kevin > >-- >Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >To make changes to your subscription: >http://www.postgresql.org/mailpref/pgsql-performance
<gnuoytr@rcn.com> wrote: > -- None of that is anything for amateurs to play with. > > Not jam a stick in anybody's eye, but shouldn't database pros not > be amateurs? While many PostgreSQL installations are managed by professional DBAs, or programmers or consultants with a deep enough grasp of the issues to tune a knob like that appropriately, PostgreSQL is also used in environments without such staff. In fact, there is pressure to make PostgreSQL easier to configure for exactly that reason. If we add more knobs which are this hard to tune correctly, we would risk inundation with complaints from people to tried to use it and made things worse. -Kevin
Kevin Grittner wrote: > > ...Sybase named caches...segment off portions of the memory for > specific caches... bind specific database > objects (tables and indexes) to specific caches. ... > > When I posted to the list about it, the response was that LRU > eviction was superior to any tuning any human would do. I didn't > and don't believe that.... > > FWIW, the four main reasons for using it were: > (1) Heavily used data could be kept fully cached in RAM... Lightly-used-but-important data seems like another use case. LRU's probably far better than me at optimizing for the total throughput and/or average response time. But if there's a requirement: "Even though this query's very rare, it should respond ASAP, even at the expense of the throughput of the rest of the system." it sounds like this kind of hand-tuning might be useful.
On Tue, Oct 12, 2010 at 10:35 AM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > (1) Heavily used data could be kept fully cached in RAM and not > driven out by transient activity. We've attempted to address this problem by adding logic to prevent the buffer cache from being trashed by vacuums, bulk loads, and sequential scans. It would be interesting to know if anyone has examples of that logic falling over or proving inadequate. > (2) You could flag a cache used for (1) above as using "relaxed LRU > accounting" -- it saved a lot of time tracking repeated references, > leaving more CPU for other purposes. We never do strict LRU accounting. > (3) Each named cache had its own separate set of locks, reducing > contention. We have lock partitions, but as discussed recently on -hackers, they seem to start falling over around 26 cores. We probably need to improve that, but I'd rather do that by making the locking more efficient and by increasing the number of partitions rather than by allowing users to partition the buffer pool by hand. > (4) Large tables for which the heap was often were scanned in its > entirety or for a range on the clustered index could be put in a > relatively small cache with large I/O buffers. This avoided blowing > out the default cache space for situations which almost always > required disk I/O anyway. I think, but am not quite sure, that my answer to point #1 is also relevant here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2010/10/13 Ron Mayer <rm_pg@cheapcomplexdevices.com>: > Kevin Grittner wrote: >> >> ...Sybase named caches...segment off portions of the memory for >> specific caches... bind specific database >> objects (tables and indexes) to specific caches. ... >> >> When I posted to the list about it, the response was that LRU >> eviction was superior to any tuning any human would do. I didn't >> and don't believe that.... >> >> FWIW, the four main reasons for using it were: >> (1) Heavily used data could be kept fully cached in RAM... > > Lightly-used-but-important data seems like another use case. > > LRU's probably far better than me at optimizing for the total > throughput and/or average response time. But if there's a > requirement: > "Even though this query's very rare, it should respond > ASAP, even at the expense of the throughput of the rest > of the system." > it sounds like this kind of hand-tuning might be useful. it is exactly one of the purpose of pgfincore : http://villemain.org/projects/pgfincore#load_a_table_or_an_index_in_os_page_cache > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support
Kevin Grittner wrote: > <gnuoytr@rcn.com> wrote: > > > -- None of that is anything for amateurs to play with. > > > > Not jam a stick in anybody's eye, but shouldn't database pros not > > be amateurs? > > While many PostgreSQL installations are managed by professional > DBAs, or programmers or consultants with a deep enough grasp of the > issues to tune a knob like that appropriately, PostgreSQL is also > used in environments without such staff. In fact, there is pressure > to make PostgreSQL easier to configure for exactly that reason. If > we add more knobs which are this hard to tune correctly, we would > risk inundation with complaints from people to tried to use it and > made things worse. Agreed. Here is a blog entry that explains some of the tradeoffs of adding knobs: http://momjian.us/main/blogs/pgblog/2009.html#January_10_2009 -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Greg Smith writes: > heard privately from two people who have done similar experiments on > Linux and found closer to 8GB to be the point where performance started So on a machine with 72GB is 8GB still the recommended value? Usually have only 10 to 20 connections.
2010/10/28 Francisco Reyes <lists@stringsutils.com>: > Greg Smith writes: > >> heard privately from two people who have done similar experiments on Linux >> and found closer to 8GB to be the point where performance started > > So on a machine with 72GB is 8GB still the recommended value? Yes, as a maximum, not a minimum. (Some applications will work better with less shared_buffers than others) > Usually have only 10 to 20 connections. > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Cédric Villemain 2ndQuadrant http://2ndQuadrant.fr/ PostgreSQL : Expertise, Formation et Support