Thread: First set of OSDL Shared Mem scalability results, some wierdness ...
Folks, I'm hoping that some of you can shed some light on this. I've been trying to peg the "sweet spot" for shared memory using OSDL's equipment. With Jan's new ARC patch, I was expecting that the desired amount of shared_buffers to be greatly increased. This has not turned out to be the case. The first test series was using OSDL's DBT2 (OLTP) test, with 150 "warehouses". All tests were run on a 4-way Pentium III 700mhz 3.8GB RAM system hooked up to a rather high-end storage device (14 spindles). Tests were on PostgreSQL 8.0b3, Linux 2.6.7. Here's a top-level summary: shared_buffers % RAM NOTPM20* 1000 0.2% 1287 23000 5% 1507 46000 10% 1481 69000 15% 1382 92000 20% 1375 115000 25% 1380 138000 30% 1344 * = New Order Transactions Per Minute, last 20 Minutes Higher is better. The maximum possible is 1800. As you can see, the "sweet spot" appears to be between 5% and 10% of RAM, which is if anything *lower* than recommendations for 7.4! This result is so surprising that I want people to take a look at it and tell me if there's something wrong with the tests or some bottlenecking factor that I've not seen. in order above: http://khack.osdl.org/stp/297959/ http://khack.osdl.org/stp/297960/ http://khack.osdl.org/stp/297961/ http://khack.osdl.org/stp/297962/ http://khack.osdl.org/stp/297963/ http://khack.osdl.org/stp/297964/ http://khack.osdl.org/stp/297965/ Please note that many of the Graphs in these reports are broken. For one thing, some aren't recorded (flat lines) and the CPU usage graph has mislabeled lines. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
I have an idea that makes some assumptions about internals that I think are correct. When you have a huge number of buffers in a list that has to be traversed to look for things in cache, e.g. 100k, you will generate an almost equivalent number of cache line misses on the processor to jump through all those buffers. As I understand it (and I haven't looked so I could be wrong), the buffer cache is searched by traversing it sequentially. OTOH, it seems reasonable to me that the OS disk cache may actually be using a tree structure that would generate vastly fewer cache misses by comparison to find a buffer. This could mean a substantial linear search cost as a function of the number of buffers, big enough to rise above the noise floor when you have hundreds of thousands of buffers. Cache misses start to really add up when a code path generates many, many thousands of them, and differences in the access path between the buffer cache and disk cache would be reflected when you have that many buffers. I've seen these types of unexpected performance anomalies before that got traced back to code patterns and cache efficiency and gotten integer factors improvements by making some seemingly irrelevant code changes. So I guess my question would be 1) are my assumptions about the internals correct, and 2) if they are, is there a way to optimize searching the buffer cache so that a search doesn't iterate over a really long buffer list that is bottlenecked on cache line replacement. My random thought of the day, j. andrew rogers
Josh Berkus <josh@agliodbs.com> writes: > Here's a top-level summary: > shared_buffers % RAM NOTPM20* > 1000 0.2% 1287 > 23000 5% 1507 > 46000 10% 1481 > 69000 15% 1382 > 92000 20% 1375 > 115000 25% 1380 > 138000 30% 1344 > As you can see, the "sweet spot" appears to be between 5% and 10% of RAM, > which is if anything *lower* than recommendations for 7.4! This doesn't actually surprise me a lot. There are a number of aspects of Postgres that will get slower the more buffers there are. One thing that I hadn't focused on till just now, which is a new overhead in 8.0, is that StrategyDirtyBufferList() scans the *entire* buffer list *every time it's called*, which is to say once per bgwriter loop. And to add insult to injury, it's doing that with the BufMgrLock held (not that it's got any choice). We could alleviate this by changing the API between this function and BufferSync, such that StrategyDirtyBufferList can stop as soon as it's found all the buffers that are going to be written in this bgwriter cycle ... but AFAICS that means abandoning the "bgwriter_percent" knob since you'd never really know how many dirty pages there were altogether. BTW, what is the actual size of the test database (disk footprint wise) and how much of that do you think is heavily accessed during the run? It's possible that the test conditions are such that adjusting shared_buffers isn't going to mean anything anyway. regards, tom lane
"J. Andrew Rogers" <jrogers@neopolitan.com> writes: > As I understand it (and I haven't looked so I could be wrong), the > buffer cache is searched by traversing it sequentially. You really should look first. The main-line code paths use hashed lookups. There are some cases that do linear searches through the buffer headers or the CDB lists; in theory those are supposed to be non-performance-critical cases, though I am suspicious that some are not (see other response). In any case, those structures are considerably more compact than the buffers proper, and I doubt that cache misses per se are the killer factor. This does raise a question for Josh though, which is "where's the oprofile results?" If we do have major problems at the level of cache misses then oprofile would be able to prove it. regards, tom lane
On Fri, Oct 08, 2004 at 06:32:32PM -0400, Tom Lane wrote: > This does raise a question for Josh though, which is "where's the > oprofile results?" If we do have major problems at the level of cache > misses then oprofile would be able to prove it. Or cachegrind. I've found it to be really effective at pinpointing cache misses in the past (one CPU-intensive routine was sped up by 30% just by avoiding a memory clear). :-) /* Steinar */ -- Homepage: http://www.sesse.net/
Tom, > This does raise a question for Josh though, which is "where's the > oprofile results?" If we do have major problems at the level of cache > misses then oprofile would be able to prove it. Missing, I'm afraid. OSDL has been having technical issues with STP all week. Hopefully the next test run will have them. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Tom, > BTW, what is the actual size of the test database (disk footprint wise) > and how much of that do you think is heavily accessed during the run? > It's possible that the test conditions are such that adjusting > shared_buffers isn't going to mean anything anyway. The raw data is 32GB, but a lot of the activity is incremental, that is inserts and updates to recent inserts. Still, according to Mark, most of the data does get queried in the course of filling orders. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Re: First set of OSDL Shared Mem scalability results, some wierdness ...
From
Christopher Browne
Date:
josh@agliodbs.com (Josh Berkus) wrote: > I've been trying to peg the "sweet spot" for shared memory using > OSDL's equipment. With Jan's new ARC patch, I was expecting that > the desired amount of shared_buffers to be greatly increased. This > has not turned out to be the case. That doesn't surprise me. My primary expectation would be that ARC would be able to make small buffers much more effective alongside vacuums and seq scans than they used to be. That does not establish anything about the value of increasing the size buffer caches... > This result is so surprising that I want people to take a look at it > and tell me if there's something wrong with the tests or some > bottlenecking factor that I've not seen. I'm aware of two conspicuous scenarios where ARC would be expected to _substantially_ improve performance: 1. When it allows a VACUUM not to throw useful data out of the shared cache in that VACUUM now only 'chews' on one page of the cache; 2. When it allows a Seq Scan to not push useful data out of the shared cache, for much the same reason. I don't imagine either scenario are prominent in the OSDL tests. Increasing the number of cache buffers _is_ likely to lead to some slowdowns: - Data that passes through the cache also passes through kernel cache, so it's recorded twice, and read twice... - The more cache pages there are, the more work is needed for PostgreSQL to manage them. That will notably happen anywhere that there is a need to scan the cache. - If there are any inefficiencies in how the OS kernel manages shared memory, as their size scales, well, that will obviously cause a slowdown. -- If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me http://www.ntlug.org/~cbbrowne/internet.html "One World. One Web. One Program." -- MICROS~1 hype "Ein Volk, ein Reich, ein Fuehrer" -- Nazi hype (One people, one country, one leader)
Christopher Browne wrote: > Increasing the number of cache buffers _is_ likely to lead to some > slowdowns: > > - Data that passes through the cache also passes through kernel > cache, so it's recorded twice, and read twice... Even worse, memory that's used for the PG cache is memory that's not available to the kernel's page cache. Even if the overall memory usage in the system isn't enough to cause some paging to disk, most modern kernels will adjust the page/disk cache size dynamically to fit the memory demands of the system, which in this case means it'll be smaller if running programs need more memory for their own use. This is why I sometimes wonder whether or not it would be a win to use mmap() to access the data and index files -- doing so under a truly modern OS would surely at the very least save a buffer copy (from the page/disk cache to program memory) because the OS could instead direcly map the buffer cache pages directly to the program's memory space. Since PG often has to have multiple files open at the same time, and in a production database many of those files will be rather large, PG would have to limit the size of the mmap()ed region on 32-bit platforms, which means that things like the order of mmap() operations to access various parts of the file can become just as important in the mmap()ed case as it is in the read()/write() case (if not more so!). I would imagine that the use of mmap() on a 64-bit platform would be a much, much larger win because PG would most likely be able to mmap() entire files and let the OS work out how to order disk reads and writes. The biggest problem as I see it is that (I think) mmap() would have to be made to cooperate with malloc() for virtual address space. I suspect issues like this have already been worked out by others, however... -- Kevin Brown kevin@sysexperts.com
Christopher Browne wrote: >josh@agliodbs.com (Josh Berkus) wrote: > > >>This result is so surprising that I want people to take a look at it >>and tell me if there's something wrong with the tests or some >>bottlenecking factor that I've not seen. >> >> >I'm aware of two conspicuous scenarios where ARC would be expected to >_substantially_ improve performance: > > 1. When it allows a VACUUM not to throw useful data out of > the shared cache in that VACUUM now only 'chews' on one > page of the cache; > Right, Josh, I assume you didn't run these test with pg_autovacuum running, which might be interesting. Also how do these numbers compare to 7.4? They may not be what you expected, but they might still be an improvment. Matthew
Kevin Brown <kevin@sysexperts.com> writes: > This is why I sometimes wonder whether or not it would be a win to use > mmap() to access the data and index files -- mmap() is Right Out because it does not afford us sufficient control over when changes to the in-memory data will propagate to disk. The address-space-management problems you describe are also a nasty headache, but that one is the showstopper. regards, tom lane
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > This is why I sometimes wonder whether or not it would be a win to use > > mmap() to access the data and index files -- > > mmap() is Right Out because it does not afford us sufficient control > over when changes to the in-memory data will propagate to disk. The > address-space-management problems you describe are also a nasty > headache, but that one is the showstopper. Huh? Surely fsync() or fdatasync() of the file descriptor associated with the mmap()ed region at the appropriate times would accomplish much of this? I'm particularly confused since PG's entire approach to disk I/O is predicated on the notion that the OS, and not PG, is the best arbiter of when data hits the disk. Otherwise it would be using raw partitions for the highest-speed data store, yes? Also, there isn't any particular requirement to use mmap() for everything -- you can use traditional open/write/close calls for the WAL and mmap() for the data/index files (but it wouldn't surprise me if this would require some extensive code changes). That said, if it's typical for many changes to made to a page internally before PG needs to commit that page to disk, then your argument makes sense, and that's especially true if we simply cannot have the page written to disk in a partially-modified state (something I can easily see being an issue for the WAL -- would the same hold true of the index/data files?). -- Kevin Brown kevin@sysexperts.com
I wrote: > That said, if it's typical for many changes to made to a page > internally before PG needs to commit that page to disk, then your > argument makes sense, and that's especially true if we simply cannot > have the page written to disk in a partially-modified state (something > I can easily see being an issue for the WAL -- would the same hold > true of the index/data files?). Also, even if multiple changes would be made to the page, with the page being valid for a disk write only after all such changes are made, the use of mmap() (in conjunction with an internal buffer that would then be copied to the mmap()ed memory space at the appropriate time) would potentially save a system call over the use of write() (even if write() were used to write out multiple pages). However, there is so much lower-hanging fruit than this that an mmap() implementation almost certainly isn't worth pursuing for this alone. So: it seems to me that mmap() is worth pursuing only if most internal buffers tend to be written to only once or if it's acceptable for a partially modified data/index page to be written to disk (which I suppose could be true for data/index pages in the face of a rock-solid WAL). -- Kevin Brown kevin@sysexperts.com
Kevin Brown <kevin@sysexperts.com> writes: > Tom Lane wrote: >> mmap() is Right Out because it does not afford us sufficient control >> over when changes to the in-memory data will propagate to disk. > ... that's especially true if we simply cannot > have the page written to disk in a partially-modified state (something > I can easily see being an issue for the WAL -- would the same hold > true of the index/data files?). You're almost there. Remember the fundamental WAL rule: log entries must hit disk before the data changes they describe. That means that we need not only a way of forcing changes to disk (fsync) but a way of being sure that changes have *not* gone to disk yet. In the existing implementation we get that by just not issuing write() for a given page until we know that the relevant WAL log entries are fsync'd down to disk. (BTW, this is what the LSN field on every page is for: it tells the buffer manager the latest WAL offset that has to be flushed before it can safely write the page.) mmap provides msync which is comparable to fsync, but AFAICS it provides no way to prevent an in-memory change from reaching disk too soon. This would mean that WAL entries would have to be written *and flushed* before we could make the data change at all, which would convert multiple updates of a single page into a series of write-and- wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction is bad enough, once per atomic action is intolerable. There is another reason for doing things this way. Consider a backend that goes haywire and scribbles all over shared memory before crashing. When the postmaster sees the abnormal child termination, it forcibly kills the other active backends and discards shared memory altogether. This gives us fairly good odds that the crash did not affect any data on disk. It's not perfect of course, since another backend might have been in process of issuing a write() when the disaster happens, but it's pretty good; and I think that that isolation has a lot to do with PG's good reputation for not corrupting data in crashes. If we had a large fraction of the address space mmap'd then this sort of crash would be just about guaranteed to propagate corruption into the on-disk files. regards, tom lane
Josh Berkus wrote: > Folks, > > I'm hoping that some of you can shed some light on this. > > I've been trying to peg the "sweet spot" for shared memory using OSDL's > equipment. With Jan's new ARC patch, I was expecting that the desired > amount of shared_buffers to be greatly increased. This has not turned out to > be the case. > > The first test series was using OSDL's DBT2 (OLTP) test, with 150 > "warehouses". All tests were run on a 4-way Pentium III 700mhz 3.8GB RAM > system hooked up to a rather high-end storage device (14 spindles). Tests > were on PostgreSQL 8.0b3, Linux 2.6.7. I'd like to see these tests running using the cpu affinity capability in order to oblige a backend to not change CPU during his life, this could drastically increase the cache hit. Regards Gaetano Mendola
On Fri, 8 Oct 2004, Josh Berkus wrote: > As you can see, the "sweet spot" appears to be between 5% and 10% of RAM, > which is if anything *lower* than recommendations for 7.4! What recommendation is that? To have shared buffers being about 10% of the ram sounds familiar to me. What was recommended for 7.4? In the past we used to say that the worst value is 50% since then the same things might be cached both by pg and the os disk cache. Why do we excpect the shared buffer size sweet spot to change because of the new arc stuff? And why would it make it better to have bigger shared mem? Wouldn't it be the opposit, that now we don't invalidate as much of the cache for vacuums and seq. scan so now we can do as good caching as before but with less shared buffers. That said, testing and getting some numbers of good sizes for shared mem is good. -- /Dennis Björklund
On 10/8/2004 10:10 PM, Christopher Browne wrote: > josh@agliodbs.com (Josh Berkus) wrote: >> I've been trying to peg the "sweet spot" for shared memory using >> OSDL's equipment. With Jan's new ARC patch, I was expecting that >> the desired amount of shared_buffers to be greatly increased. This >> has not turned out to be the case. > > That doesn't surprise me. Neither does it surprise me. > > My primary expectation would be that ARC would be able to make small > buffers much more effective alongside vacuums and seq scans than they > used to be. That does not establish anything about the value of > increasing the size buffer caches... The primary goal of ARC is to prevent total cache eviction caused by sequential scans. Which means it is designed to avoid the catastrophic impact of a pg_dump or other, similar access in parallel to the OLTP traffic. It would be much more interesting to see how a half way into a 2 hour measurement interval started pg_dump affects the response times. One also has to take a closer look at the data of the DBT2. What amount of that 32GB is high-frequently accessed, and therefore a good thing to live in the PG shared cache? A cache significantly larger than that doesn't make sense to me, under no cache strategy. Jan > >> This result is so surprising that I want people to take a look at it >> and tell me if there's something wrong with the tests or some >> bottlenecking factor that I've not seen. > > I'm aware of two conspicuous scenarios where ARC would be expected to > _substantially_ improve performance: > > 1. When it allows a VACUUM not to throw useful data out of > the shared cache in that VACUUM now only 'chews' on one > page of the cache; > > 2. When it allows a Seq Scan to not push useful data out of > the shared cache, for much the same reason. > > I don't imagine either scenario are prominent in the OSDL tests. > > Increasing the number of cache buffers _is_ likely to lead to some > slowdowns: > > - Data that passes through the cache also passes through kernel > cache, so it's recorded twice, and read twice... > > - The more cache pages there are, the more work is needed for > PostgreSQL to manage them. That will notably happen anywhere > that there is a need to scan the cache. > > - If there are any inefficiencies in how the OS kernel manages shared > memory, as their size scales, well, that will obviously cause a > slowdown. -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On 10/9/2004 7:20 AM, Kevin Brown wrote: > Christopher Browne wrote: >> Increasing the number of cache buffers _is_ likely to lead to some >> slowdowns: >> >> - Data that passes through the cache also passes through kernel >> cache, so it's recorded twice, and read twice... > > Even worse, memory that's used for the PG cache is memory that's not > available to the kernel's page cache. Even if the overall memory Which underlines my previous statement, that a PG shared cache much larger than the high-frequently accessed data portion of the DB is counterproductive. Double buffering (kernel-disk-buffer plus shared buffer) only makes sense for data that would otherwise cause excessive memory copies in and out of the shared buffer. After that, in only lowers the memory available for disk buffers. Jan > usage in the system isn't enough to cause some paging to disk, most > modern kernels will adjust the page/disk cache size dynamically to fit > the memory demands of the system, which in this case means it'll be > smaller if running programs need more memory for their own use. > > This is why I sometimes wonder whether or not it would be a win to use > mmap() to access the data and index files -- doing so under a truly > modern OS would surely at the very least save a buffer copy (from the > page/disk cache to program memory) because the OS could instead > direcly map the buffer cache pages directly to the program's memory > space. > > Since PG often has to have multiple files open at the same time, and > in a production database many of those files will be rather large, PG > would have to limit the size of the mmap()ed region on 32-bit > platforms, which means that things like the order of mmap() operations > to access various parts of the file can become just as important in > the mmap()ed case as it is in the read()/write() case (if not more > so!). I would imagine that the use of mmap() on a 64-bit platform > would be a much, much larger win because PG would most likely be able > to mmap() entire files and let the OS work out how to order disk reads > and writes. > > The biggest problem as I see it is that (I think) mmap() would have to > be made to cooperate with malloc() for virtual address space. I > suspect issues like this have already been worked out by others, > however... > > > -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > On 10/8/2004 10:10 PM, Christopher Browne wrote: > > > josh@agliodbs.com (Josh Berkus) wrote: > >> I've been trying to peg the "sweet spot" for shared memory using > >> OSDL's equipment. With Jan's new ARC patch, I was expecting that > >> the desired amount of shared_buffers to be greatly increased. This > >> has not turned out to be the case. > > That doesn't surprise me. > > Neither does it surprise me. There's been some speculation that having a large shared buffers be about 50% of your RAM is pessimal as it guarantees the OS cache is merely doubling up on all the buffers postgres is keeping. I wonder whether there's a second sweet spot where the postgres cache is closer to the total amount of RAM. That configuration would have disadvantages for servers running other jobs besides postgres. And I was led to believe earlier that postgres starts each backend with a fairly fresh slate as far as the ARC algorithm, so it wouldn't work well for a postgres server that had lots of short to moderate life sessions. But if it were even close it could be interesting. Reading the data with O_DIRECT and having a single global cache could be interesting experiments. I know there are arguments against each of these, but ... I'm still pulling for an mmap approach to eliminate postgres's buffer cache entirely in the long term, but it seems like slim odds now. But one way or the other having two layers of buffering seems like a waste. -- greg
On 10/13/2004 11:52 PM, Greg Stark wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > >> On 10/8/2004 10:10 PM, Christopher Browne wrote: >> >> > josh@agliodbs.com (Josh Berkus) wrote: >> >> I've been trying to peg the "sweet spot" for shared memory using >> >> OSDL's equipment. With Jan's new ARC patch, I was expecting that >> >> the desired amount of shared_buffers to be greatly increased. This >> >> has not turned out to be the case. >> > That doesn't surprise me. >> >> Neither does it surprise me. > > There's been some speculation that having a large shared buffers be about 50% > of your RAM is pessimal as it guarantees the OS cache is merely doubling up on > all the buffers postgres is keeping. I wonder whether there's a second sweet > spot where the postgres cache is closer to the total amount of RAM. Which would require that shared memory is not allowed to be swapped out, and that is allowed in Linux by default IIRC, not to completely distort the entire test. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Jan Wieck <JanWieck@Yahoo.com> writes: > Which would require that shared memory is not allowed to be swapped out, and > that is allowed in Linux by default IIRC, not to completely distort the entire > test. Well if it's getting swapped out then it's clearly not being used effectively. There are APIs to bar swapping out pages and the tests could be run without swap. I suggested it only as an experiment though, there are lots of details between here and having it be a good configuration for production use. -- greg
On 10/14/2004 12:22 AM, Greg Stark wrote: > Jan Wieck <JanWieck@Yahoo.com> writes: > >> Which would require that shared memory is not allowed to be swapped out, and >> that is allowed in Linux by default IIRC, not to completely distort the entire >> test. > > Well if it's getting swapped out then it's clearly not being used effectively. Is it really that easy if 3 different cache algorithms (PG cache, kernel buffers and swapping) are competing for the same chips? Jan > > There are APIs to bar swapping out pages and the tests could be run without > swap. I suggested it only as an experiment though, there are lots of details > between here and having it be a good configuration for production use. > -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Tom Lane wrote: > >> mmap() is Right Out because it does not afford us sufficient control > >> over when changes to the in-memory data will propagate to disk. > > > ... that's especially true if we simply cannot > > have the page written to disk in a partially-modified state (something > > I can easily see being an issue for the WAL -- would the same hold > > true of the index/data files?). > > You're almost there. Remember the fundamental WAL rule: log entries > must hit disk before the data changes they describe. That means that we > need not only a way of forcing changes to disk (fsync) but a way of > being sure that changes have *not* gone to disk yet. In the existing > implementation we get that by just not issuing write() for a given page > until we know that the relevant WAL log entries are fsync'd down to > disk. (BTW, this is what the LSN field on every page is for: it tells > the buffer manager the latest WAL offset that has to be flushed before > it can safely write the page.) > > mmap provides msync which is comparable to fsync, but AFAICS it > provides no way to prevent an in-memory change from reaching disk too > soon. This would mean that WAL entries would have to be written *and > flushed* before we could make the data change at all, which would > convert multiple updates of a single page into a series of write-and- > wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction > is bad enough, once per atomic action is intolerable. Hmm...something just occurred to me about this. Would a hybrid approach be possible? That is, use mmap() to handle reads, and use write() to handle writes? Any code that wishes to write to a page would have to recognize that it's doing so and fetch a copy from the storage manager (or something), which would look to see if the page already exists as a writeable buffer. If it doesn't, it creates it by allocating the memory and then copying the page from the mmap()ed area to the new buffer, and returning it. If it does, it just returns a pointer to the buffer. There would obviously have to be some bookkeeping involved: the storage manager would have to know how to map a mmap()ed page back to a writeable buffer and vice-versa, so that once it decides to write the buffer it can determine which page in the original file the buffer corresponds to (so it can do the appropriate seek()). In a write-heavy database, you'll end up with a lot of memory copy operations, but with the scheme we currently use you get that anyway (it just happens in kernel code instead of user code), so I don't see that as much of a loss, if any. Where you win is in a read-heavy database: you end up being able to read directly from the pages in the kernel's page cache and thus save a memory copy from kernel space to user space, not to mention the context switch that happens due to issuing the read(). Obviously you'd want to mmap() the file read-only in order to prevent the issues you mention regarding an errant backend, and then reopen the file read-write for the purpose of writing to it. In fact, you could decouple the two: mmap() the file, then close the file -- the mmap()ed region will remain mapped. Then, as long as the file remains mapped, you need to open the file again only when you want to write to it. -- Kevin Brown kevin@sysexperts.com
First off, I'd like to get involved with these tests - pressure of other work only has prevented me. Here's my take on the results so far: I think taking the ratio of the memory allocated to shared_buffers against the total memory available on the server is completely fallacious. That is why they cannnot be explained - IMHO the ratio has no real theoretical basis. The important ratio for me is the amount of shared_buffers against the total size of the database in the benchmark test. Every database workload has a differing percentage of the total database size that represents the "working set", or the memory that can be beneficially cached. For the tests that DBT-2 is performing, I say that there is only so many blocks that are worth the trouble caching. If you cache more than this, you are wasting your time. For me, these tests don't show that there is a "sweet spot" that you should set your shared_buffers to, only that for that specific test, you have located the correct size for shared_buffers. For me, it would be an incorrect inference that this could then be interpreted that this was the percentage of the available RAM where the "sweet spot" lies for all workloads. The theoretical basis for my comments is this: DBT-2 is essentially a static workload. That means, for a long test, we can work out with reasonable certainty the probability that a block will be requested, for every single block in the database. Given a particular size of cache, you can work out what your overall cache hit ratio is and therfore what your speed up is compared with retrieving every single block from disk (the no cache scenario). If you draw a graph of speedup (y) against cache size as a % of total database size, the graph looks like an upside-down "L" - i.e. the graph rises steeply as you give it more memory, then turns sharply at a particular point, after which it flattens out. The "turning point" is the "sweet spot" we all seek - the optimum amount of cache memory to allocate - but this spot depends upon the worklaod and database size, not on available RAM on the system under test. Clearly, the presence of the OS disk cache complicates this. Since we have two caches both allocated from the same pot of memory, it should be clear that if we overallocate one cache beyond its optimium effectiveness, while the second cache is still in its "more is better" stage, then we will get reduced performance. That seems to be the case here. I wouldn't accept that a fixed ratio between the two caches exists for ALL, or even the majority of workloads - though clearly broad brush workloads such as "OLTP" and "Data Warehousing" do have similar-ish requirements. As an example, lets look at an example: An application with two tables: SmallTab has 10,000 rows of 100 bytes each (so table is ~1 Mb)- one row per photo in a photo gallery web site. LargeTab has large objects within it and has 10,000 photos, average size 10 Mb (so table is ~100Gb). Assuming all photos are requested randomly, you can see that an optimum cache size for this workload is 1Mb RAM, 100Gb disk. Trying to up the cache doesn't have much effect on the probability that a photo (from LargeTab) will be in cache, unless you have a large % of 100Gb of RAM, when you do start to make gains. (Please don't be picky about indexes, catalog, block size etc). That clearly has absolutely nothing at all to do with the RAM of the system on which it is running. I think Jan has said this also in far fewer words, but I'll leave that to Jan to agree/disagree... I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as large a shared_buffers cache as is required by the database workload, and this should not be constrained to a small percentage of server RAM. Best Regards, Simon Riggs > -----Original Message----- > From: pgsql-performance-owner@postgresql.org > [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Josh Berkus > Sent: 08 October 2004 22:43 > To: pgsql-performance@postgresql.org > Cc: testperf-general@pgfoundry.org > Subject: [PERFORM] First set of OSDL Shared Mem scalability results, > some wierdness ... > > > Folks, > > I'm hoping that some of you can shed some light on this. > > I've been trying to peg the "sweet spot" for shared memory using OSDL's > equipment. With Jan's new ARC patch, I was expecting that the desired > amount of shared_buffers to be greatly increased. This has not > turned out to > be the case. > > The first test series was using OSDL's DBT2 (OLTP) test, with 150 > "warehouses". All tests were run on a 4-way Pentium III 700mhz > 3.8GB RAM > system hooked up to a rather high-end storage device (14 > spindles). Tests > were on PostgreSQL 8.0b3, Linux 2.6.7. > > Here's a top-level summary: > > shared_buffers % RAM NOTPM20* > 1000 0.2% 1287 > 23000 5% 1507 > 46000 10% 1481 > 69000 15% 1382 > 92000 20% 1375 > 115000 25% 1380 > 138000 30% 1344 > > * = New Order Transactions Per Minute, last 20 Minutes > Higher is better. The maximum possible is 1800. > > As you can see, the "sweet spot" appears to be between 5% and 10% of RAM, > which is if anything *lower* than recommendations for 7.4! > > This result is so surprising that I want people to take a look at > it and tell > me if there's something wrong with the tests or some bottlenecking factor > that I've not seen. > > in order above: > http://khack.osdl.org/stp/297959/ > http://khack.osdl.org/stp/297960/ > http://khack.osdl.org/stp/297961/ > http://khack.osdl.org/stp/297962/ > http://khack.osdl.org/stp/297963/ > http://khack.osdl.org/stp/297964/ > http://khack.osdl.org/stp/297965/ > > Please note that many of the Graphs in these reports are broken. For one > thing, some aren't recorded (flat lines) and the CPU usage graph has > mislabeled lines. > > -- > --Josh > > Josh Berkus > Aglio Database Solutions > San Francisco > > > ---------------------------(end of broadcast)--------------------------- > TIP 8: explain analyze is your friend
Simon, <lots of good stuff clipped> > If you draw a graph of speedup (y) against cache size as a > % of total database size, the graph looks like an upside-down "L" - i.e. > the graph rises steeply as you give it more memory, then turns sharply at a > particular point, after which it flattens out. The "turning point" is the > "sweet spot" we all seek - the optimum amount of cache memory to allocate - > but this spot depends upon the worklaod and database size, not on available > RAM on the system under test. Hmmm ... how do you explain, then the "camel hump" nature of the real performance? That is, when we allocated even a few MB more than the "optimum" ~190MB, overall performance stated to drop quickly. The result is that allocating 2x optimum RAM is nearly as bad as allocating too little (e.g. 8MB). The only explanation I've heard of this so far is that there is a significant loss of efficiency with larger caches. Or do you see the loss of 200MB out of 3500MB would actually affect the Kernel cache that much? Anyway, one test of your theory that I can run immediately is to run the exact same workload on a bigger, faster server and see if the desired quantity of shared_buffers is roughly the same. I'm hoping that you're wrong -- not because I don't find your argument persuasive, but because if you're right it leaves us without any reasonable ability to recommend shared_buffer settings. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Thu, 2004-10-14 at 16:57 -0700, Josh Berkus wrote: > Simon, > > <lots of good stuff clipped> > > > If you draw a graph of speedup (y) against cache size as a > > % of total database size, the graph looks like an upside-down "L" - i.e. > > the graph rises steeply as you give it more memory, then turns sharply at a > > particular point, after which it flattens out. The "turning point" is the > > "sweet spot" we all seek - the optimum amount of cache memory to allocate - > > but this spot depends upon the worklaod and database size, not on available > > RAM on the system under test. > > Hmmm ... how do you explain, then the "camel hump" nature of the real > performance? That is, when we allocated even a few MB more than the > "optimum" ~190MB, overall performance stated to drop quickly. The result is > that allocating 2x optimum RAM is nearly as bad as allocating too little > (e.g. 8MB). > > The only explanation I've heard of this so far is that there is a significant > loss of efficiency with larger caches. Or do you see the loss of 200MB out > of 3500MB would actually affect the Kernel cache that much? > In a past life there seemed to be a sweet spot around the applications working set. Performance went up until you got just a little larger than the cache needed to hold the working set and then went down. Most of the time a nice looking hump. It seems to have to do with the additional pages not increasing your hit ratio but increasing the amount of work to get a hit in cache. This seemed to be independent of the actual database software being used. (I observed this running Oracle, Informix, Sybase and Ingres.) > Anyway, one test of your theory that I can run immediately is to run the exact > same workload on a bigger, faster server and see if the desired quantity of > shared_buffers is roughly the same. I'm hoping that you're wrong -- not > because I don't find your argument persuasive, but because if you're right it > leaves us without any reasonable ability to recommend shared_buffer settings. > -- Timothy D. Witham - Chief Technology Officer - wookie@osdl.org Open Source Development Lab Inc - A non-profit corporation 12725 SW Millikan Way - Suite 400 - Beaverton OR, 97005 (503)-626-2455 x11 (office) (503)-702-2871 (cell) (503)-626-2436 (fax)
Re: First set of OSDL Shared Mem scalability results, some wierdness ...
From
Christopher Browne
Date:
Quoth simon@2ndquadrant.com ("Simon Riggs"): > I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as > large a shared_buffers cache as is required by the database > workload, and this should not be constrained to a small percentage > of server RAM. I don't think that this particularly follows from "what ARC does." "What ARC does" is to prevent certain conspicuous patterns of sequential accesses from essentially trashing the contents of the cache. If a particular benchmark does not include conspicuous vacuums or sequential scans on large tables, then there is little reason to expect ARC to have a noticeable impact on performance. It _could_ be that this implies that ARC allows you to get some use out of a larger shared cache, as it won't get blown away by vacuums and Seq Scans. But it is _not_ obvious that this is a necessary truth. _Other_ truths we know about are: a) If you increase the shared cache, that means more data that is represented in both the shared cache and the OS buffer cache, which seems rather a waste; b) The larger the shared cache, the more pages there are for the backend to rummage through before it looks to the filesystem, and therefore the more expensive cache misses get. Cache hits get more expensive, too. Searching through memory is not costless. -- (format nil "~S@~S" "cbbrowne" "acm.org") http://linuxfinances.info/info/linuxdistributions.html "The X-Files are too optimistic. The truth is *not* out there..." -- Anthony Ord <nws@rollingthunder.co.uk>
Kevin Brown <kevin@sysexperts.com> writes: > Hmm...something just occurred to me about this. > Would a hybrid approach be possible? That is, use mmap() to handle > reads, and use write() to handle writes? Nope. Have you read the specs regarding mmap-vs-stdio synchronization? Basically it says that there are no guarantees whatsoever if you try this. The SUS text is a bit weaselly ("the application must ensure correct synchronization") but the HPUX mmap man page, among others, lays it on the line: It is also unspecified whether write references to a memory region mapped with MAP_SHARED are visible to processes reading the file and whether writes to a file are visible to processes that have mapped the modified portion of that file, except for the effect of msync(). It might work on particular OSes but I think depending on such behavior would be folly... regards, tom lane
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
"Simon Riggs"
Date:
>Timothy D. Witham > On Thu, 2004-10-14 at 16:57 -0700, Josh Berkus wrote: > > Simon, > > > > <lots of good stuff clipped> > > > > > If you draw a graph of speedup (y) against cache size as a > > > % of total database size, the graph looks like an upside-down > "L" - i.e. > > > the graph rises steeply as you give it more memory, then > turns sharply at a > > > particular point, after which it flattens out. The "turning > point" is the > > > "sweet spot" we all seek - the optimum amount of cache memory > to allocate - > > > but this spot depends upon the worklaod and database size, > not on available > > > RAM on the system under test. > > > > Hmmm ... how do you explain, then the "camel hump" nature of the real > > performance? That is, when we allocated even a few MB more than the > > "optimum" ~190MB, overall performance stated to drop quickly. > The result is > > that allocating 2x optimum RAM is nearly as bad as allocating > too little > > (e.g. 8MB). Two ways of explaining this: 1. Once you've hit the optimum size of shared_buffers, you may not yet have hit the optimum size of the OS cache. If that is true, every extra block given to shared_buffers is wasted, yet detracts from the beneficial effect of the OS cache. I don't see how the small drop in size of the OS cache could have the effect you have measured, so I suggest that this possible explanation doesn't fit the results well. 2. There is some algorithmic effect within PostgreSQL that makes larger shared_buffers much worse than smaller ones. Imagine that each extra block we hold in cache has the positive benefit from caching, minus a postulated negative drag effect. With that model we would get: Once the optimal size of the cache has been reached the positive benefit tails off to almost zero and we are just left with the situation that each new block added to shared_buffers acts as a further drag on performance. That model would fit the results, so we can begin to look at what the drag effect might be. Speculating wildly because I don't know that portion of the code this might be: CONJECTURE 1: the act of searching for a block in cache is an O(n) operation, not an O(1) or O(log n) operation - so searching a larger cache has an additional slowing effect on the application, via a buffer cache lock that is held while the cache is searched - larger caches are locked for longer than smaller caches, so this causes additional contention in the system, which then slows down performance. The effect might show up by examining the oprofile results for the test cases. What we would be looking for is something that is being called more frequently with larger shared_buffers - this could be anything....but my guess is the oprofile results won't be similar and could lead us to a better understanding. > > > > The only explanation I've heard of this so far is that there is > a significant > > loss of efficiency with larger caches. Or do you see the loss > of 200MB out > > of 3500MB would actually affect the Kernel cache that much? > > > In a past life there seemed to be a sweet spot around the > applications > working set. Performance went up until you got just a little larger > than > the cache needed to hold the working set and then went down. Most of > the time a nice looking hump. It seems to have to do with the > additional pages > not increasing your hit ratio but increasing the amount of work to get a > hit in cache. This seemed to be independent of the actual database > software being used. (I observed this running Oracle, Informix, Sybase > and Ingres.) Good, our experiences seems to be similar. > > > Anyway, one test of your theory that I can run immediately is > to run the exact > > same workload on a bigger, faster server and see if the desired > quantity of > > shared_buffers is roughly the same. I agree that you could test this by running on a bigger or smaller server, i.e. one with more or less RAM. Running on a faster/slower server at the same time might alter the results and confuse the situation. > I'm hoping that you're wrong -- not > because I don't find your argument persuasive, but because if > you're right it > > leaves us without any reasonable ability to recommend > shared_buffer settings. > For the record, what I think we need is dynamically resizable shared_buffers, not a-priori knowledge of what you should set shared_buffers to. I've been thinking about implementing a scheme that helps you decide how big the shared_buffers SHOULD BE, by making the LRU list bigger than the cache itself, so you'd be able to see whether there is beneficial effect in increasing shared_buffers. ...remember that this applies to other databases too, and with those we find that they have dynamically resizable memory. Having said all that, there are still a great many other performance tests to run so that we CAN recommend other settings, such as the optimizer cost parameters, bg writer defaults etc. Best Regards, Simon Riggs 2nd Quadrant
Tom Lane wrote: > Kevin Brown <kevin@sysexperts.com> writes: > > Hmm...something just occurred to me about this. > > > Would a hybrid approach be possible? That is, use mmap() to handle > > reads, and use write() to handle writes? > > Nope. Have you read the specs regarding mmap-vs-stdio synchronization? > Basically it says that there are no guarantees whatsoever if you try > this. The SUS text is a bit weaselly ("the application must ensure > correct synchronization") but the HPUX mmap man page, among others, > lays it on the line: > > It is also unspecified whether write references to a memory region > mapped with MAP_SHARED are visible to processes reading the file and > whether writes to a file are visible to processes that have mapped the > modified portion of that file, except for the effect of msync(). > > It might work on particular OSes but I think depending on such behavior > would be folly... Yeah, and at this point it can't be considered portable in any real way because of this. Thanks for the perspective. I should have expected the general specification to be quite broken in this regard, not to mention certain implementations. :-) Good thing there's a lot of lower-hanging fruit than this... -- Kevin Brown kevin@sysexperts.com
Tom Lane wrote: >Kevin Brown <kevin@sysexperts.com> writes: > > >>Hmm...something just occurred to me about this. >> >> >>Would a hybrid approach be possible? That is, use mmap() to handle >>reads, and use write() to handle writes? >> >> > >Nope. Have you read the specs regarding mmap-vs-stdio synchronization? >Basically it says that there are no guarantees whatsoever if you try >this. The SUS text is a bit weaselly ("the application must ensure >correct synchronization") but the HPUX mmap man page, among others, >lays it on the line: > > It is also unspecified whether write references to a memory region > mapped with MAP_SHARED are visible to processes reading the file and > whether writes to a file are visible to processes that have mapped the > modified portion of that file, except for the effect of msync(). > >It might work on particular OSes but I think depending on such behavior >would be folly... > We have some anecdotal experience along these lines: There was a set of kernel bugs in Solaris 2.6 or 7 related to this as well. We had several kernel panics and it took a bit to chase down, but the basic feedback was "oops. we're screwed". I've forgotten most of the details right now; the basic problem was a file was being read+written via mmap and read()/write() at (essentially) the same time from the same pid. It would panic the system quite reliably. I believe the bugs related to this have been resolved in Solaris, but it was unpleasant to chase that problem down... -- Alan
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes: > Speculating wildly because I don't know that portion of the code this might > be: > CONJECTURE 1: the act of searching for a block in cache is an O(n) > operation, not an O(1) or O(log n) operation I'm not sure how this meme got into circulation, but I've seen a couple of people recently either conjecturing or asserting that. Let me remind people of the actual facts: 1. We use a hashtable to keep track of which blocks are currently in shared buffers. Either a cache hit or a cache miss should be O(1), because the hashtable size is scaled proportionally to shared_buffers, and so the number of hash entries examined should remain constant. 2. There are some allegedly-not-performance-critical operations that do scan through all the buffers, and therefore are O(N) in shared_buffers. I just eyeballed all the latter, and came up with this list of O(N) operations and their call points: AtEOXact_Buffers transaction commit or abort UnlockBuffers transaction abort, backend exit StrategyDirtyBufferList background writer's idle loop FlushRelationBuffers VACUUM DROP TABLE, DROP INDEX TRUNCATE, CLUSTER, REINDEX ALTER TABLE SET TABLESPACE DropRelFileNodeBuffers TRUNCATE (only for ON COMMIT TRUNC temp tables) REINDEX (inplace case only) smgr_internal_unlink (ie, the tail end of DROP TABLE/INDEX) DropBuffers DROP DATABASE The fact that the first two are called during transaction commit/abort is mildly alarming. The constant factors are going to be very tiny though, because what these routines actually do is scan backend-local status arrays looking for locked buffers, which they're not going to find very many of. For instance AtEOXact_Buffers looks like int i; for (i = 0; i < NBuffers; i++) { if (PrivateRefCount[i] != 0) { // some code that should never be executed at all in the commit // case, and not that much in the abort case either } } I suppose with hundreds of thousands of shared buffers this might get to the point of being noticeable, but I've never seen it show up at all in profiling with more-normal buffer counts. Not sure if it's worth devising a more complex data structure to aid in finding locked buffers. (To some extent this code is intended to be belt-and-suspenders stuff for catching omissions elsewhere, and so a more complex data structure that could have its own bugs is not especially attractive.) The one that's bothering me at the moment is StrategyDirtyBufferList, which is a new overhead in 8.0. It wouldn't directly affect foreground query performance, but indirectly it would hurt by causing the bgwriter to suck more CPU cycles than one would like (and it holds the BufMgrLock while it's doing it, too :-(). One easy way you could see whether this is an issue in the OSDL test is to see what happens if you double all three bgwriter parameters (delay, percent, maxpages). This should result in about the same net I/O demand from the bgwriter, but StrategyDirtyBufferList will be executed half as often. I doubt that the other ones are issues. We could improve them by devising a way to quickly find all buffers for a given relation, but I am just about sure that complicating the buffer management to do so would be a net loss for normal workloads. > For the record, what I think we need is dynamically resizable > shared_buffers, not a-priori knowledge of what you should set > shared_buffers to. This isn't likely to happen because the SysV shared memory API isn't conducive to it. Absent some amazingly convincing demonstration that we have to have it, the effort of making it happen in a portable way isn't going to get spent. > I've been thinking about implementing a scheme that helps you decide how > big the shared_buffers SHOULD BE, by making the LRU list bigger than the > cache itself, so you'd be able to see whether there is beneficial effect in > increasing shared_buffers. ARC already keeps such a list --- couldn't you learn what you want to know from the existing data structure? It'd be fairly cool if we could put out warnings "you ought to increase shared_buffers" analogous to the existing facility for noting excessive checkpointing. regards, tom lane
Tom Lane wrote: > > I've been thinking about implementing a scheme that helps you decide how > > big the shared_buffers SHOULD BE, by making the LRU list bigger than the > > cache itself, so you'd be able to see whether there is beneficial effect in > > increasing shared_buffers. > > ARC already keeps such a list --- couldn't you learn what you want to > know from the existing data structure? It'd be fairly cool if we could > put out warnings "you ought to increase shared_buffers" analogous to the > existing facility for noting excessive checkpointing. Agreed. ARC already keeps a list of buffers it had to push out recently so if it needs them again soon it knows its sizing of recent/frequent might be off (I think). Anyway, such a log report would be super-cool, say if you pushed out a buffer and needed it very soon, and the ARC buffers are already at their maximum for that buffer pool. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
"Simon Riggs"
Date:
> Bruce Momjian > Tom Lane wrote: > > > I've been thinking about implementing a scheme that helps you > decide how > > > big the shared_buffers SHOULD BE, by making the LRU list > bigger than the > > > cache itself, so you'd be able to see whether there is > beneficial effect in > > > increasing shared_buffers. > > > > ARC already keeps such a list --- couldn't you learn what you want to > > know from the existing data structure? It'd be fairly cool if we could > > put out warnings "you ought to increase shared_buffers" analogous to the > > existing facility for noting excessive checkpointing. First off, many thanks for taking the time to provide the real detail on the code. That gives us some much needed direction in interpreting the oprofile output. > > Agreed. ARC already keeps a list of buffers it had to push out recently > so if it needs them again soon it knows its sizing of recent/frequent > might be off (I think). Anyway, such a log report would be super-cool, > say if you pushed out a buffer and needed it very soon, and the ARC > buffers are already at their maximum for that buffer pool. > OK, I guess I hadn't realised we were half-way there. The "increase shared_buffers" warning would be useful, but it would be much cooler to have some guidance as to how big to set it, especially since this requires a restart of the server. What I had in mind was a way of keeping track of how the buffer cache hit ratio would look at various sizes of shared_buffers, for example 50%, 80%, 120%, 150%, 200% and 400% say. That way you'd stand a chance of plotting the curve and thereby assessing how much memory could be allocated. I've got a few ideas, but I need to check out the code first. I'll investigate both simple/complex options as an 8.1 feature. Best Regards, Simon Riggs
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
Josh Berkus
Date:
People: > First off, many thanks for taking the time to provide the real detail on > the code. > > That gives us some much needed direction in interpreting the oprofile > output. I have some oProfile output; however, it's in 2 out of 20 tests I ran recently and I need to get them sorted out. --Josh -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
Josh Berkus
Date:
Tom, Simon: First off, two test runs with OProfile are available at: http://khack.osdl.org/stp/298124/ http://khack.osdl.org/stp/298121/ > AtEOXact_Buffers > transaction commit or abort > UnlockBuffers > transaction abort, backend exit Actually, this might explain the "hump" shape of the curve for this test. DBT2 is an OLTP test, which means that (at this scale level) it's attempting to do approximately 30 COMMITs per second as well as one ROLLBACK every 3 seconds. When I get the tests on DBT3 running, if we see a more gentle dropoff on overallocated memory, it would indicate that the above may be a factor. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
> this. The SUS text is a bit weaselly ("the application must ensure > correct synchronization") but the HPUX mmap man page, among others, > lays it on the line: > > It is also unspecified whether write references to a memory region > mapped with MAP_SHARED are visible to processes reading the file > and > whether writes to a file are visible to processes that have > mapped the > modified portion of that file, except for the effect of msync(). > > It might work on particular OSes but I think depending on such behavior > would be folly... Agreed. Only OSes with a coherent file system buffer cache should ever use mmap(2). In order for this to work on HPUX, msync(2) would need to be used. -sc -- Sean Chittenden
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes: > First off, two test runs with OProfile are available at: > http://khack.osdl.org/stp/298124/ > http://khack.osdl.org/stp/298121/ Hmm. The stuff above 1% in the first of these is Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000 samples % app name symbol name 8522858 19.7539 vmlinux default_idle 3510225 8.1359 vmlinux recalc_sigpending_tsk 1874601 4.3449 vmlinux .text.lock.signal 1653816 3.8331 postgres SearchCatCache 1080908 2.5053 postgres AllocSetAlloc 920369 2.1332 postgres AtEOXact_Buffers 806218 1.8686 postgres OpernameGetCandidates 803125 1.8614 postgres StrategyDirtyBufferList 746123 1.7293 vmlinux __copy_from_user_ll 651978 1.5111 vmlinux __copy_to_user_ll 640511 1.4845 postgres XLogInsert 630797 1.4620 vmlinux rm_from_queue 607833 1.4088 vmlinux next_thread 436682 1.0121 postgres LWLockAcquire 419672 0.9727 postgres yyparse In the second test AtEOXact_Buffers is much lower (down around 0.57 percent) but the other suspects are similar. Since the only difference in parameters is shared_buffers (36000 vs 9000), it does look like we are approaching the point where AtEOXact_Buffers is a problem, but so far it's only a 2% drag. I suspect the reason recalc_sigpending_tsk is so high is that the original coding of PG_TRY involved saving and restoring the signal mask, which led to a whole lot of sigsetmask-type kernel calls. Is this test with beta3, or something older? Another interesting item here is the costs of __copy_from_user_ll/ __copy_to_user_ll: 36000 buffers: 746123 1.7293 vmlinux __copy_from_user_ll 651978 1.5111 vmlinux __copy_to_user_ll 9000 buffers: 866414 2.0810 vmlinux __copy_from_user_ll 852620 2.0479 vmlinux __copy_to_user_ll Presumably the higher costs for 9000 buffers reflect an increased amount of shuffling of data between kernel and user space. So 36000 is not enough to make the working set totally memory-resident, but even if we drove this cost to zero we'd only be buying a couple percent. regards, tom lane
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
Josh Berkus
Date:
Tom, > I suspect the reason recalc_sigpending_tsk is so high is that the > original coding of PG_TRY involved saving and restoring the signal mask, > which led to a whole lot of sigsetmask-type kernel calls. Is this test > with beta3, or something older? Beta3, *without* Gavin or Neil's Futex patch. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes: >> I suspect the reason recalc_sigpending_tsk is so high is that the >> original coding of PG_TRY involved saving and restoring the signal mask, >> which led to a whole lot of sigsetmask-type kernel calls. Is this test >> with beta3, or something older? > Beta3, *without* Gavin or Neil's Futex patch. Hmm, in that case the cost deserves some further investigation. Can we find out just what that routine does and where it's being called from? regards, tom lane
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
Mark Wong
Date:
On Fri, Oct 15, 2004 at 05:27:29PM -0400, Tom Lane wrote: > Josh Berkus <josh@agliodbs.com> writes: > >> I suspect the reason recalc_sigpending_tsk is so high is that the > >> original coding of PG_TRY involved saving and restoring the signal mask, > >> which led to a whole lot of sigsetmask-type kernel calls. Is this test > >> with beta3, or something older? > > > Beta3, *without* Gavin or Neil's Futex patch. > > Hmm, in that case the cost deserves some further investigation. Can we > find out just what that routine does and where it's being called from? > There's a call-graph feature with oprofile as of version 0.8 with the opstack tool, but I'm having a terrible time figuring out why the output isn't doing the graphing part. Otherwise, I'd have that available already... Mark
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
Tom Lane
Date:
Mark Wong <markw@osdl.org> writes: > On Fri, Oct 15, 2004 at 05:27:29PM -0400, Tom Lane wrote: >> Hmm, in that case the cost deserves some further investigation. Can we >> find out just what that routine does and where it's being called from? > There's a call-graph feature with oprofile as of version 0.8 with > the opstack tool, but I'm having a terrible time figuring out why the > output isn't doing the graphing part. Otherwise, I'd have that > available already... I was wondering if this might be associated with do_sigaction. do_sigaction is only 0.23 percent of the runtime according to the oprofile results: http://khack.osdl.org/stp/298124/oprofile/DBT_2_Profile-all.oprofile.txt but the profile results for the same run: http://khack.osdl.org/stp/298124/profile/DBT_2_Profile-tick.sort show do_sigaction very high and recalc_sigpending_tsk nowhere at all. Something funny there. regards, tom lane
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
Mark Wong
Date:
On Fri, Oct 15, 2004 at 05:44:34PM -0400, Tom Lane wrote: > Mark Wong <markw@osdl.org> writes: > > On Fri, Oct 15, 2004 at 05:27:29PM -0400, Tom Lane wrote: > >> Hmm, in that case the cost deserves some further investigation. Can we > >> find out just what that routine does and where it's being called from? > > > There's a call-graph feature with oprofile as of version 0.8 with > > the opstack tool, but I'm having a terrible time figuring out why the > > output isn't doing the graphing part. Otherwise, I'd have that > > available already... > > I was wondering if this might be associated with do_sigaction. > do_sigaction is only 0.23 percent of the runtime according to the > oprofile results: > http://khack.osdl.org/stp/298124/oprofile/DBT_2_Profile-all.oprofile.txt > but the profile results for the same run: > http://khack.osdl.org/stp/298124/profile/DBT_2_Profile-tick.sort > show do_sigaction very high and recalc_sigpending_tsk nowhere at all. > Something funny there. > I have always attributed those kind of differences based on how readprofile and oprofile collect their data. Granted I don't exactly understand it. Anyone familiar with the two differences? Mark
Getting rid of AtEOXact_Buffers (was Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...)
From
Tom Lane
Date:
I wrote: > Josh Berkus <josh@agliodbs.com> writes: >> First off, two test runs with OProfile are available at: >> http://khack.osdl.org/stp/298124/ >> http://khack.osdl.org/stp/298121/ > Hmm. The stuff above 1% in the first of these is > Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000 > samples % app name symbol name > ... > 920369 2.1332 postgres AtEOXact_Buffers > ... > In the second test AtEOXact_Buffers is much lower (down around 0.57 > percent) but the other suspects are similar. Since the only difference > in parameters is shared_buffers (36000 vs 9000), it does look like we > are approaching the point where AtEOXact_Buffers is a problem, but so > far it's only a 2% drag. It occurs to me that given the 8.0 resource manager mechanism, we could in fact dispense with AtEOXact_Buffers, or perhaps better turn it into a no-op unless #ifdef USE_ASSERT_CHECKING. We'd just get rid of the special case for transaction termination in resowner.c and let the resource owner be responsible for releasing locked buffers always. The OSDL results suggest that this won't matter much at the level of 10000 or so shared buffers, but for 100000 or more buffers the linear scan in AtEOXact_Buffers is going to become a problem. We could also get rid of the linear search in UnlockBuffers(). The only thing it's for anymore is to release a BM_PIN_COUNT_WAITER flag, and since a backend could not be doing more than one of those at a time, we don't really need an array of flags for that, only a single variable. This does not show in the OSDL results, which I presume means that their test case is not exercising transaction aborts; but I think we need to zap both routines to make the world safe for large shared_buffers values. (See also http://archives.postgresql.org/pgsql-performance/2004-10/msg00218.php) Any objection to doing this for 8.0? regards, tom lane
Re: Getting rid of AtEOXact_Buffers (was Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...)
From
Josh Berkus
Date:
Tom, > We could also get rid of the linear search in UnlockBuffers(). The only > thing it's for anymore is to release a BM_PIN_COUNT_WAITER flag, and > since a backend could not be doing more than one of those at a time, > we don't really need an array of flags for that, only a single variable. > This does not show in the OSDL results, which I presume means that their > test case is not exercising transaction aborts; In the test, one out of every 100 new order transactions is aborted (about 1 out of 150 transactions overall). -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Re: Getting rid of AtEOXact_Buffers (was Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...)
From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes: >> This does not show in the OSDL results, which I presume means that their >> test case is not exercising transaction aborts; > In the test, one out of every 100 new order transactions is aborted (about 1 > out of 150 transactions overall). Okay, but that just ensures that any bottlenecks in xact abort will be down in the noise in this test case ... In any case, those changes are in CVS now if you want to try them. regards, tom lane
Re: Getting rid of AtEOXact_Buffers (was Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...)
From
Josh Berkus
Date:
Tom, > In any case, those changes are in CVS now if you want to try them. OK. Will have to wait until OSDL gives me a dedicated testing machine sometime mon/tues/wed. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On 10/14/2004 6:36 PM, Simon Riggs wrote: > [...] > I think Jan has said this also in far fewer words, but I'll leave that to > Jan to agree/disagree... I do agree. The total DB size has as little to do with the optimum shared buffer cache size as the total available RAM of the machine. After reading your comments it appears more clear to me. All what those tests did show is the amount of high frequently accessed data in this database population and workload combination. > > I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as large a > shared_buffers cache as is required by the database workload, and this > should not be constrained to a small percentage of server RAM. Right. Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
On 10/14/2004 8:10 PM, Christopher Browne wrote: > Quoth simon@2ndquadrant.com ("Simon Riggs"): >> I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as >> large a shared_buffers cache as is required by the database >> workload, and this should not be constrained to a small percentage >> of server RAM. > > I don't think that this particularly follows from "what ARC does." The combination of ARC together with the background writer is supposed to allow us to allocate the optimum even if that is large. The former implementation of the LRU without background writer would just hang the server for a long time during a checkpoint, which is absolutely inacceptable for any OLTP system. Jan > > "What ARC does" is to prevent certain conspicuous patterns of > sequential accesses from essentially trashing the contents of the > cache. > > If a particular benchmark does not include conspicuous vacuums or > sequential scans on large tables, then there is little reason to > expect ARC to have a noticeable impact on performance. > > It _could_ be that this implies that ARC allows you to get some use > out of a larger shared cache, as it won't get blown away by vacuums > and Seq Scans. But it is _not_ obvious that this is a necessary > truth. > > _Other_ truths we know about are: > > a) If you increase the shared cache, that means more data that is > represented in both the shared cache and the OS buffer cache, > which seems rather a waste; > > b) The larger the shared cache, the more pages there are for the > backend to rummage through before it looks to the filesystem, > and therefore the more expensive cache misses get. Cache hits > get more expensive, too. Searching through memory is not > costless. -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
From
Josh Berkus
Date:
Simon, > I agree that you could test this by running on a bigger or smaller server, > i.e. one with more or less RAM. Running on a faster/slower server at the > same time might alter the results and confuse the situation. Unfortunately, a faster server is the only option I have that also has more RAM. If I double the RAM and double the processors at the same time, what would you expect to happen to the shared_buffers curve? -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Simon, Folks, I've put links to all of my OSDL-STP test results up on the TestPerf project: http://pgfoundry.org/forum/forum.php?thread_id=164&forum_id=160 SHare&Enjoy! -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Sat, 9 Oct 2004, Tom Lane wrote: > mmap provides msync which is comparable to fsync, but AFAICS it > provides no way to prevent an in-memory change from reaching disk too > soon. This would mean that WAL entries would have to be written *and > flushed* before we could make the data change at all, which would > convert multiple updates of a single page into a series of write-and- > wait-for-WAL-fsync steps. Not good. fsync'ing WAL once per transaction > is bad enough, once per atomic action is intolerable. Back when I was working out how to do this, I reckoned that you could use mmap by keeping a write queue for each modified page. Reading, you'd have to read the datum from the page and then check the write queue for that page to see if that datum had been updated, using the new value if it's there. Writing, you'd add the modified datum to the write queue, but not apply the write queue to the page until you'd had confirmation that the corresponding transaction log entry had been written. So multiple writes are no big deal; they just all queue up in the write queue, and at any time you can apply as much of the write queue to the page itself as the current log entry will allow. There are several different strategies available for mapping and unmapping the pages, and in fact there might need to be several available to get the best performance out of different systems. Most OSes do not seem to be optimized for having thousands or tens of thousands of small mappings (certainly NetBSD isn't), but I've never done any performance tests to see what kind of strategies might work well or not. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.NetBSD.org Make up enjoying your city life...produced by BIC CAMERA
Curt Sampson <cjs@cynic.net> writes: > Back when I was working out how to do this, I reckoned that you could > use mmap by keeping a write queue for each modified page. Reading, > you'd have to read the datum from the page and then check the write > queue for that page to see if that datum had been updated, using the > new value if it's there. Writing, you'd add the modified datum to the > write queue, but not apply the write queue to the page until you'd had > confirmation that the corresponding transaction log entry had been > written. So multiple writes are no big deal; they just all queue up in > the write queue, and at any time you can apply as much of the write > queue to the page itself as the current log entry will allow. Seems to me the overhead of any such scheme would swamp the savings from avoiding kernel/userspace copies ... the locking issues alone would be painful. regards, tom lane
On Sat, 23 Oct 2004, Tom Lane wrote: > Seems to me the overhead of any such scheme would swamp the savings from > avoiding kernel/userspace copies ... Well, one really can't know without testing, but memory copies are extremely expensive if they go outside of the cache. > the locking issues alone would be painful. I don't see why they would be any more painful than the current locking issues. In fact, I don't see any reason to add more locking than we already use when updating pages. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.NetBSD.org Make up enjoying your city life...produced by BIC CAMERA
Curt Sampson <cjs@cynic.net> writes: > On Sat, 23 Oct 2004, Tom Lane wrote: >> Seems to me the overhead of any such scheme would swamp the savings from >> avoiding kernel/userspace copies ... > Well, one really can't know without testing, but memory copies are > extremely expensive if they go outside of the cache. Sure, but what about all the copying from write queue to page? >> the locking issues alone would be painful. > I don't see why they would be any more painful than the current locking > issues. Because there are more locks --- the write queue data structure will need to be locked separately from the page. (Even with a separate write queue per page, there will need to be a shared data structure that allows you to allocate and find write queues, and that thing will be a subject of contention. See BufMgrLock, which is not held while actively twiddling the contents of pages, but is a serious cause of contention anyway.) regards, tom lane
On Sun, 24 Oct 2004, Tom Lane wrote: > > Well, one really can't know without testing, but memory copies are > > extremely expensive if they go outside of the cache. > > Sure, but what about all the copying from write queue to page? There's a pretty big difference between few-hundred-bytes-on-write and eight-kilobytes-with-every-read memory copy. As for the queue allocation, again, I have no data to back this up, but I don't think it would be as bad as BufMgrLock. Not every page will have a write queue, and a "hot" page is only going to get one once. (If a page has a write queue, you might as well leave it with the page after flushing it, and get rid of it only when the page leaves memory.) I see the OS issues related to mapping that much memory as a much bigger potential problem. cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.NetBSD.org Make up enjoying your city life...produced by BIC CAMERA
Curt Sampson <cjs@cynic.net> writes: > I see the OS issues related to mapping that much memory as a much bigger > potential problem. I see potential problems everywhere I look ;-) Considering that the available numbers suggest we could win just a few percent (and that's assuming that all this extra mechanism has zero cost), I can't believe that the project is worth spending manpower on. There is a lot of much more attractive fruit hanging at lower levels. The bitmap-indexing stuff that was recently being discussed, for instance, would certainly take less effort than this; it would create no new portability issues; and at least for the queries where it helps, it could offer integer-multiple speedups, not percentage points. My engineering professors taught me that you put large effort where you have a chance at large rewards. Converting PG to mmap doesn't seem to meet that test, even if I believed it would work. regards, tom lane
On Sun, 24 Oct 2004, Tom Lane wrote: > Considering that the available numbers suggest we could win just a few > percent... I must confess that I was completely unaware of these "numbers." Where do I find them? cjs -- Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.NetBSD.org Make up enjoying your city life...produced by BIC CAMERA
Curt Sampson <cjs@cynic.net> writes: > On Sun, 24 Oct 2004, Tom Lane wrote: >> Considering that the available numbers suggest we could win just a few >> percent... > I must confess that I was completely unaware of these "numbers." Where > do I find them? The only numbers I've seen that directly bear on the question is the oprofile results that Josh recently put up for the DBT-3 benchmark, which showed the kernel copy-to-userspace and copy-from-userspace subroutines eating a percent or two apiece of the total runtime. I don't have the URL at hand but it was posted just a few days ago. (Now that covers all such copies and not only our datafile reads/writes, but it's probably fair to assume that the datafile I/O is the bulk of it.) This is, of course, only one benchmark ... but lacking any measurements in opposition, I'm inclined to believe it. regards, tom lane
I wrote: > I don't have the URL at hand but it was posted just a few days ago. ... actually, it was the beginning of this here thread ... regards, tom lane