Thread: Bug: Buffer cache is not scan resistant
I'm putting this out there before we publish a fix so that we can discuss how best to fix it. Doug and Sherry recently found the source of an important performance issue with the Postgres shared buffer cache. The issue is summarized like this: the buffer cache in PGSQL is not "scan resistant" as advertised. A sequential scan of a table larger than cache will pollute the buffer cache in almost all circumstances. Here is performance of GPDB 2.301 (Postgres 8.1.6) on a single X4500 (thumper-3) with 4 cores where "bigtable" is a table 2x the size of RAM and "memtable" is a table that fits into I/O cache: With our default setting of shared_buffers (16MB): Operation memtable bigtable --------------------------------------------------- SELECT COUNT(*) 1221 MB/s 973 MB/s VACUUM 1709 MB/s 1206 MB/s We had observed that VACUUM would perform better when done right after a SELECT. In the above example, the faster rate from disk was 1608 MB/s, compared to the normal rate of 1206 MB/s. We verified this behavior on Postgres 8.2 as well. The buffer selection algorithm is choosing buffer pages scattered throughout the buffer cache in almost all circumstances. Sherry traced the behavior to the processor repeatedly flushing the L2 cache. Doug found that we weren't using the Postgres buffer cache the way we expected, instead we were loading the scanned data from disk into the cache even though there was no possibility of reusing it. In addition to pushing other, possibly useful pages from the cache, it has the additional behavior of invalidating the L2 cache for the remainder of the executor path that uses the data. To prove that the buffer cache was the source of the problem, we dropped the shared buffer size to fit into L2 cache (1MB per Opteron core), and this is what we saw (drop size of shared buffers to 680KB): Operation memtable bigtable --------------------------------------------------- SELECT COUNT(*) 1320 MB/s 1059 MB/s VACUUM 3033 MB/s 1597 MB/s These results do not vary with the order of operations. Thoughts on the best way to fix the buffer selection algorithm? Ideally, one page would be used in the buffer cache in circumstances where the table to be scanned is (significantly?) larger than the size of the buffer cache. - Luke
"Luke Lonergan" <llonergan@greenplum.com> writes: > The issue is summarized like this: the buffer cache in PGSQL is not "scan > resistant" as advertised. Sure it is. As near as I can tell, your real complaint is that the bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; which is hardly surprising considering it doesn't know the size of L2 cache. That's not a consideration that we've ever taken into account. I'm also less than convinced that it'd be helpful for a big seqscan: won't reading a new disk page into memory via DMA cause that memory to get flushed from the processor cache anyway? I wonder whether your numbers are explained by some other consideration than you think. regards, tom lane
<p><font size="2">When we instrument the page selections made within the buffer cache, they are sequential and span the entireaddress space of the cache.<br /><br /> With respect to whether it's L2, it's a conclusion based on the experimentalresults. It's not the TLB, as we also tested for the 512 entries for each L2.<br /><br /> One thing I left outof the previous post: the difference between fast and slow behavior was that in the fast case, the buffer selection alternatedbetween two buffer pages. This was the case only when the preceding statement was a SELECT and the statement wasVACUUM.<br /><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br /> -----Original Message-----<br /> From: Tom Lane [<a href="mailto:tgl@sss.pgh.pa.us">mailto:tgl@sss.pgh.pa.us</a>]<br /> Sent: Sunday, March 04, 2007 08:36PM Eastern Standard Time<br /> To: Luke Lonergan<br /> Cc: PGSQL Hackers; Doug Rady; Sherry Moore<br /> Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant<br /><br /> "Luke Lonergan" <llonergan@greenplum.com>writes:<br /> > The issue is summarized like this: the buffer cache in PGSQL is not "scan<br/> > resistant" as advertised.<br /><br /> Sure it is. As near as I can tell, your real complaint is that the<br/> bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;<br /> which is hardly surprising consideringit doesn't know the size of L2<br /> cache. That's not a consideration that we've ever taken into account.<br/><br /> I'm also less than convinced that it'd be helpful for a big seqscan:<br /> won't reading a new disk pageinto memory via DMA cause that memory to<br /> get flushed from the processor cache anyway? I wonder whether your<br/> numbers are explained by some other consideration than you think.<br /><br /> regards,tom lane<br /><br /></font>
<p><font size="2">One more thing: the L2 is invalidated when re-written from the kernel IO cache, but the pages addressedin L2 retain their values when 'writeen thru' which allows the new data to be re-used up the executor chain.<br/><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br /> -----Original Message-----<br /> From: TomLane [<a href="mailto:tgl@sss.pgh.pa.us">mailto:tgl@sss.pgh.pa.us</a>]<br /> Sent: Sunday, March 04, 2007 08:36 PM EasternStandard Time<br /> To: Luke Lonergan<br /> Cc: PGSQL Hackers; Doug Rady; Sherry Moore<br /> Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant<br /><br /> "Luke Lonergan" <llonergan@greenplum.com>writes:<br /> > The issue is summarized like this: the buffer cache in PGSQL is not "scan<br/> > resistant" as advertised.<br /><br /> Sure it is. As near as I can tell, your real complaint is that the<br/> bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache;<br /> which is hardly surprising consideringit doesn't know the size of L2<br /> cache. That's not a consideration that we've ever taken into account.<br/><br /> I'm also less than convinced that it'd be helpful for a big seqscan:<br /> won't reading a new disk pageinto memory via DMA cause that memory to<br /> get flushed from the processor cache anyway? I wonder whether your<br/> numbers are explained by some other consideration than you think.<br /><br /> regards,tom lane<br /><br /></font>
Tom Lane wrote: > "Luke Lonergan" <llonergan@greenplum.com> writes: >> The issue is summarized like this: the buffer cache in PGSQL is not "scan >> resistant" as advertised. > > Sure it is. As near as I can tell, your real complaint is that the > bufmgr doesn't attempt to limit its usage footprint to fit in L2 cache; > which is hardly surprising considering it doesn't know the size of L2 > cache. That's not a consideration that we've ever taken into account. > To add a little to this - forgetting the scan resistant point for the moment... cranking down shared_buffers to be smaller than the L2 cache seems to help *any* sequential scan immensely, even on quite modest HW: e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram, SELECT count(*) FROM lineitem (which is about 11GB) performance: Shared_buffers Elapsed -------------- ------- 400MB 101 s 128KB 74 s When I've profiled this activity, I've seen a lot of time spent searching for/allocating a new buffer for each page being fetched. Obviously having less of them to search through will help, but having less than the L2 cache-size worth of 'em seems to help a whole lot! Cheers Mark
On Mon, 5 Mar 2007, Mark Kirkwood wrote: > To add a little to this - forgetting the scan resistant point for the > moment... cranking down shared_buffers to be smaller than the L2 cache > seems to help *any* sequential scan immensely, even on quite modest HW: > > e.g: PIII 1.26Ghz 512Kb L2 cache, 2G ram, > > SELECT count(*) FROM lineitem (which is about 11GB) performance: > > Shared_buffers Elapsed > -------------- ------- > 400MB 101 s > 128KB 74 s > > When I've profiled this activity, I've seen a lot of time spent > searching for/allocating a new buffer for each page being fetched. > Obviously having less of them to search through will help, but having > less than the L2 cache-size worth of 'em seems to help a whole lot! Could you demonstrate that point by showing us timings for shared_buffers sizes from 512K up to, say, 2 MB? The two numbers you give there might just have to do with managing a large buffer. Thanks, Gavin
Gavin, Mark, > Could you demonstrate that point by showing us timings for > shared_buffers sizes from 512K up to, say, 2 MB? The two > numbers you give there might just have to do with managing a > large buffer. I suggest two experiments that we've already done: 1) increase shared buffers to double the L2 cache size, you should see that the behavior reverts to the "slow" performance and is constant at larger sizes 2) instrument the calls to BufferGetPage() (a macro) and note that the buffer block numbers returned increase sequentially during scans of tables larger than the buffer size - Luke
Gavin Sherry <swm@alcove.com.au> writes: > Could you demonstrate that point by showing us timings for shared_buffers > sizes from 512K up to, say, 2 MB? The two numbers you give there might > just have to do with managing a large buffer. Using PG CVS HEAD on 64-bit Intel Xeon (1MB L2 cache), Fedora Core 5, I don't measure any noticeable difference in seqscan speed for shared_buffers set to 32MB or 256kB. I note that the code would not let me choose the latter setting without a large decrease in max_connections, which might be expected to cause some performance changes in itself. Now this may only prove that the disk subsystem on this machine is too cheap to let the system show any CPU-related issues. I'm seeing a scan rate of about 43MB/sec for both count(*) and plain ol' "wc", which is a factor of 4 or so less than Mark's numbers suggest... but "top" shows CPU usage of less than 5%, so even with a 4x faster disk I'd not really expect that CPU speed would become interesting. (This is indeed a milestone, btw, because it wasn't so long ago that count(*) was nowhere near disk speed.) regards, tom lane
Grzegorz Jaskiewicz <gj@pointblue.com.pl> writes: > On Mar 5, 2007, at 2:36 AM, Tom Lane wrote: >> I'm also less than convinced that it'd be helpful for a big seqscan: >> won't reading a new disk page into memory via DMA cause that memory to >> get flushed from the processor cache anyway? > Nope. DMA is writing directly into main memory. If the area was in > the L2/L1 cache, it will get invalidated. But if it isn't there, it > is okay. So either way, it isn't in processor cache after the read. So how can there be any performance benefit? regards, tom lane
> So either way, it isn't in processor cache after the read. > So how can there be any performance benefit? It's the copy from kernel IO cache to the buffer cache that is L2 sensitive. When the shared buffer cache is polluted, it thrashes the L2 cache. When the number of pages being written to in the kernel->user space writes fits in L2, then the L2 lines are "written through" (see the link below on page 264 for the write combining features of the opteron for example) and the writes to main memory are deferred. http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ 25112.PDF - Luke
Hi Tom, > Now this may only prove that the disk subsystem on this > machine is too cheap to let the system show any CPU-related > issues. Try it with a warm IO cache. As I posted before, we see double the performance of a VACUUM from a table in IO cache when the shared buffer cache isn't being polluted. The speed with large buffer cache should be about 450 MB/s and the speed with a buffer cache smaller than L2 should be about 800 MB/s. The real issue here isn't the L2 behavior, though that's important when trying to reach very high IO speeds, the issue is that we're seeing the buffer cache pollution in the first place. When we instrument the blocks selected by the buffer page selection algorithm, we see that they iterate sequentially, filling the shared buffer cache. That's the source of the problem here. Do we have a regression test somewhere for this? - Luke
On Mar 5, 2007, at 2:36 AM, Tom Lane wrote: > n into account. > > I'm also less than convinced that it'd be helpful for a big seqscan: > won't reading a new disk page into memory via DMA cause that memory to > get flushed from the processor cache anyway? Nope. DMA is writing directly into main memory. If the area was in the L2/L1 cache, it will get invalidated. But if it isn't there, it is okay. -- Grzegorz Jaskiewicz gj@pointblue.com.pl
"Luke Lonergan" <LLonergan@greenplum.com> writes: >> So either way, it isn't in processor cache after the read. >> So how can there be any performance benefit? > It's the copy from kernel IO cache to the buffer cache that is L2 > sensitive. When the shared buffer cache is polluted, it thrashes the L2 > cache. When the number of pages being written to in the kernel->user > space writes fits in L2, then the L2 lines are "written through" (see > the link below on page 264 for the write combining features of the > opteron for example) and the writes to main memory are deferred. That makes absolutely zero sense. The data coming from the disk was certainly not in processor cache to start with, and I hope you're not suggesting that it matters whether the *target* page of a memcpy was already in processor cache. If the latter, it is not our bug to fix. > http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ > 25112.PDF Even granting that your conclusions are accurate, we are not in the business of optimizing Postgres for a single CPU architecture. regards, tom lane
Hi Tom, > Even granting that your conclusions are accurate, we are not > in the business of optimizing Postgres for a single CPU architecture. I think you're missing my/our point: The Postgres shared buffer cache algorithm appears to have a bug. When there is a sequential scan the blocks are filling the entire shared buffer cache. This should be "fixed". My proposal for a fix: ensure that when relations larger (much larger?) than buffer cache are scanned, they are mapped to a single page in the shared buffer cache. - Luke
Luke Lonergan wrote: > The Postgres shared buffer cache algorithm appears to have a bug. When > there is a sequential scan the blocks are filling the entire shared > buffer cache. This should be "fixed". > > My proposal for a fix: ensure that when relations larger (much larger?) > than buffer cache are scanned, they are mapped to a single page in the > shared buffer cache. It's not that simple. Using the whole buffer cache for a single seqscan is ok, if there's currently no better use for the buffer cache. Running a single select will indeed use the whole cache, but if you run any other smaller queries, the pages they need should stay in cache and the seqscan will loop through the other buffers. In fact, the pages that are left in the cache after the seqscan finishes would be useful for the next seqscan of the same table if we were smart enough to read those pages first. That'd make a big difference for seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if Jeff's sync seqscan patch adresses that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Ühel kenal päeval, E, 2007-03-05 kell 03:51, kirjutas Luke Lonergan: > Hi Tom, > > > Even granting that your conclusions are accurate, we are not > > in the business of optimizing Postgres for a single CPU architecture. > > I think you're missing my/our point: > > The Postgres shared buffer cache algorithm appears to have a bug. When > there is a sequential scan the blocks are filling the entire shared > buffer cache. This should be "fixed". > > My proposal for a fix: ensure that when relations larger (much larger?) > than buffer cache are scanned, they are mapped to a single page in the > shared buffer cache. How will this approach play together with synchronized scan patches ? Or should synchronized scan rely on systems cache only ? > - Luke > > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
"Luke Lonergan" <LLonergan@greenplum.com> writes: > I think you're missing my/our point: > The Postgres shared buffer cache algorithm appears to have a bug. When > there is a sequential scan the blocks are filling the entire shared > buffer cache. This should be "fixed". No, this is not a bug; it is operating as designed. The point of the current bufmgr algorithm is to replace the page least recently used, and that's what it's doing. If you want to lobby for changing the algorithm, then you need to explain why one test case on one platform justifies de-optimizing for a lot of other cases. In almost any concurrent-access situation I think that what you are suggesting would be a dead loss --- for instance we might as well forget about Jeff Davis' synchronized-scan work. In any case, I'm still not convinced that you've identified the problem correctly, because your explanation makes no sense to me. How can the processor's L2 cache improve access to data that it hasn't got yet? regards, tom lane
* Tom Lane: > That makes absolutely zero sense. The data coming from the disk was > certainly not in processor cache to start with, and I hope you're not > suggesting that it matters whether the *target* page of a memcpy was > already in processor cache. If the latter, it is not our bug to fix. Uhm, if it's not in the cache, you typically need to evict some cache lines to make room for the data, so I'd expect an indirect performance hit. I could be mistaken, though. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99
Ühel kenal päeval, E, 2007-03-05 kell 04:15, kirjutas Tom Lane: > "Luke Lonergan" <LLonergan@greenplum.com> writes: > > I think you're missing my/our point: > > > The Postgres shared buffer cache algorithm appears to have a bug. When > > there is a sequential scan the blocks are filling the entire shared > > buffer cache. This should be "fixed". > > No, this is not a bug; it is operating as designed. Maybe he means that there is an oversight (aka "bug") in the design ;) > The point of the > current bufmgr algorithm is to replace the page least recently used, > and that's what it's doing. > > If you want to lobby for changing the algorithm, then you need to > explain why one test case on one platform justifies de-optimizing > for a lot of other cases. If you know beforehand that you will definitely overflow cache and not reuse it anytime soon, then it seems quite reasonable to not even start polluting the cache. Especially, if you get a noticable boost in performance while doing so. > In almost any concurrent-access situation > I think that what you are suggesting would be a dead loss Only if the concurrent access patern is over data mostly fitting in buffer cache. If we can avoid polluting buffer cache with data we know we will use only once, more useful data will be available. > --- for > instance we might as well forget about Jeff Davis' synchronized-scan > work. Depends on ratio of system cache/shared buffer cache. I don't think Jeff's patch is anywere near the point it needs to start worrying about data swapping between system cache and shared burrers, or L2 cache usage > In any case, I'm still not convinced that you've identified the problem > correctly, because your explanation makes no sense to me. How can the > processor's L2 cache improve access to data that it hasn't got yet? > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 7: You can help support the PostgreSQL project by donating at > > http://www.postgresql.org/about/donate -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
> > The Postgres shared buffer cache algorithm appears to have a bug. > > When there is a sequential scan the blocks are filling the entire > > shared buffer cache. This should be "fixed". > > No, this is not a bug; it is operating as designed. The > point of the current bufmgr algorithm is to replace the page > least recently used, and that's what it's doing. At least we've established that for certain. > If you want to lobby for changing the algorithm, then you > need to explain why one test case on one platform justifies > de-optimizing for a lot of other cases. In almost any > concurrent-access situation I think that what you are > suggesting would be a dead loss --- for instance we might as > well forget about Jeff Davis' synchronized-scan work. Instead of forgetting about it, we'd need to change it. > In any case, I'm still not convinced that you've identified > the problem correctly, because your explanation makes no > sense to me. How can the processor's L2 cache improve access > to data that it hasn't got yet? The evidence seems to clearly indicate reduced memory writing due to an L2 related effect. The actual data shows a dramatic reduction in main memory writing when the destination of the written data fits in the L2 cache. I'll try to fit a hypothesis to explain it. Assume you've got a warm IO cache in the OS. The heapscan algorithm now works like this: 0) select a destination user buffer 1) uiomove->kcopy memory from the IO cache to the user buffer 1A) step 1: read from kernel space 1B) step 2: write to user space 2) the user buffer is accessed many times by the executor nodes above Repeat There are two situations we are evaluating: one where the addresses of the user buffer are scattered over a space larger than the size of L2 (caseA) and one where they are confined to the size of L2 (caseB). Note that we could also consider another situation where the addresses are scattered over a space smaller than the TLB entries mapped by the L2 cache (512 max) and larger than the size of L2, but we've tried that and it proved uninteresting. For both cases step 1A is the same: each block (8KB) write from (1) will read from IO cache into 128 L2 (64B each) lines, evicting the previous data there. In step 1B for caseA the destination for the writes is mostly an address not currently mapped into L2 cache, so 128 victim L2 lines are found (LRU), stored into, and writes are flushed to main memory. In step 1B for caseB, the destination for the writes is located in L2 already. The 128 L2 lines are stored into, and the write to main memory is delayed under the assumption that these lines are "hot" as they were already in L2. I don't know enough to be sure this is the right answer, but it does fit the experimental data. - Luke
Gavin Sherry wrote: > On Mon, 5 Mar 2007, Mark Kirkwood wrote: > >> To add a little to this - forgetting the scan resistant point for the >> moment... cranking down shared_buffers to be smaller than the L2 cache >> seems to help *any* sequential scan immensely, even on quite modest HW: >> > (snipped) >> When I've profiled this activity, I've seen a lot of time spent >> searching for/allocating a new buffer for each page being fetched. >> Obviously having less of them to search through will help, but having >> less than the L2 cache-size worth of 'em seems to help a whole lot! > > Could you demonstrate that point by showing us timings for shared_buffers > sizes from 512K up to, say, 2 MB? The two numbers you give there might > just have to do with managing a large buffer. Yeah - good point: PIII 1.26 Ghz 512Kb L2 cache 2G RAM Test is elapsed time for: SELECT count(*) FROM lineitem lineitem has 1535724 pages (11997 MB) Shared Buffers Elapsed IO rate (from vmstat) -------------- ------- --------------------- 400MB 101 s 122 MB/s 2MB 100 s 1MB 97 s 768KB 93 s 512KB 86 s 256KB 77 s 128KB 74 s 166 MB/s I've added the observed IO rate for the two extreme cases (the rest can be pretty much deduced via interpolation). Note that the system will do about 220 MB/s with the now (in)famous dd test, so we have a bit of headroom (not too bad for a PIII). Cheers Mark
Hi Mark, > lineitem has 1535724 pages (11997 MB) > > Shared Buffers Elapsed IO rate (from vmstat) > -------------- ------- --------------------- > 400MB 101 s 122 MB/s > > 2MB 100 s > 1MB 97 s > 768KB 93 s > 512KB 86 s > 256KB 77 s > 128KB 74 s 166 MB/s > > I've added the observed IO rate for the two extreme cases > (the rest can be pretty much deduced via interpolation). > > Note that the system will do about 220 MB/s with the now > (in)famous dd test, so we have a bit of headroom (not too bad > for a PIII). What's really interesting: try this with a table that fits into I/O cache (say half your system memory), and run VACUUM on the table. That way the effect will stand out more dramatically. - Luke
"Luke Lonergan" <LLonergan@greenplum.com> writes: > The evidence seems to clearly indicate reduced memory writing due to an > L2 related effect. You might try using valgrind's cachegrind tool which I understand can actually emulate various processors' cache to show how efficiently code uses it. I haven't done much with it though so I don't know how applicable it would be to a large-scale effect like this. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Mark Kirkwood <markir@paradise.net.nz> writes: > Shared Buffers Elapsed IO rate (from vmstat) > -------------- ------- --------------------- > 400MB 101 s 122 MB/s > 2MB 100 s > 1MB 97 s > 768KB 93 s > 512KB 86 s > 256KB 77 s > 128KB 74 s 166 MB/s Hm, that seems to blow the "it's an L2 cache effect" theory out of the water. If it were a cache effect then there should be a performance cliff at the point where the cache size is exceeded. I see no such cliff, in fact the middle part of the curve is darn near a straight line on a log scale ... So I'm back to asking what we're really measuring here. Buffer manager inefficiency of some sort, but what? Have you tried oprofile? regards, tom lane
Tom Lane wrote: > Mark Kirkwood <markir@paradise.net.nz> writes: > >> Shared Buffers Elapsed IO rate (from vmstat) >> -------------- ------- --------------------- >> 400MB 101 s 122 MB/s >> 2MB 100 s >> 1MB 97 s >> 768KB 93 s >> 512KB 86 s >> 256KB 77 s >> 128KB 74 s 166 MB/s >> > > So I'm back to asking what we're really measuring here. Buffer manager > inefficiency of some sort, but what? Have you tried oprofile? > Isn't the size of the shared buffer pool itself acting as a performance penalty in this case ? May be StrategyGetBuffer() needs to make multiple passes over the buffers before the usage_count of any buffer is reduced to zero and the buffer is chosen as replacement victim. There is no real advantage of having larger shared buffer pool in this particular test. A heap buffer is hardly accessed again once the seqscan passes over it. Can we try with a smaller value for BM_MAX_USAGE_COUNT and see if that has any positive impact for large shared pool in this case ? Thanks, Pavan
Hi Tom, On 3/5/07 8:53 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Hm, that seems to blow the "it's an L2 cache effect" theory out of the > water. If it were a cache effect then there should be a performance > cliff at the point where the cache size is exceeded. I see no such > cliff, in fact the middle part of the curve is darn near a straight > line on a log scale ... > > So I'm back to asking what we're really measuring here. Buffer manager > inefficiency of some sort, but what? Have you tried oprofile? How about looking at the CPU performance counters directly using cpustat: cpustat -c BU_fill_into_L2,umask=0x1 1 This shows us how many L2 fills there are on all four cores (we use all four). In the case without buffer cache pollution, below is the trace of L2 fills. In the pollution case we fill 27 million lines, in the pollution case we fill 44 million lines. VACUUM orders (no buffer pollution):51.006 1 tick 275429351.006 2 tick 315956551.006 3 tick 297192951.007 0 tick 357748752.006 1 tick 421417952.006 3 tick 365019352.006 2 tick 390582852.007 0 tick 346526153.006 1 tick 181876653.006 3 tick 154601853.006 2 tick 170938553.007 0 tick 1483371 And here is the case with buffer pollution: VACUUM orders (with buffer pollution)22.006 0 tick 157611422.006 1 tick 154260422.006 2 tick 198736622.006 3 tick 178456723.006 3 tick 270605923.006 2 tick 236204823.006 0 tick 219071923.006 1 tick 208882724.006 0 tick 224747324.006 1 tick 215385024.006 2 tick 242273024.006 3 tick 275879525.006 0 tick 241943625.006 1 tick 222960225.006 2 tick 261933325.006 3 tick 271233226.006 1 tick 182792326.006 2 tick 188655626.006 3 tick 290974626.006 0 tick 1467164
"Pavan Deolasee" <pavan@enterprisedb.com> writes: > Isn't the size of the shared buffer pool itself acting as a performance > penalty in this case ? May be StrategyGetBuffer() needs to make multiple > passes over the buffers before the usage_count of any buffer is reduced > to zero and the buffer is chosen as replacement victim. I read that and thought you were onto something, but it's not acting quite the way I expect. I made a quick hack in StrategyGetBuffer() to count the number of buffers it looks at before finding a victim. Running it with just 32 buffers on a large count(*), the behavior after the initial startup transient is quite odd: got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries got buffer 3 after 1 tries got buffer 4 after 1 tries got buffer 5 after 1 tries got buffer 6 after 1 tries got buffer 7 after 1 tries got buffer 8 after 1 tries got buffer 12 after 4 tries got buffer 14 after 2 tries got buffer 21 after 7 tries got buffer 26 after 5 tries got buffer 27 after 1 tries got buffer 31 after 4 tries got buffer 0 after 1 tries got buffer 1 after 1 tries got buffer 9 after 8 tries got buffer 10 after 1 tries got buffer 11 after 1 tries got buffer 13 after 2 tries got buffer 15 after 2 tries got buffer 16 after 1 tries got buffer 17 after 1 tries got buffer 18 after 1 tries got buffer 19 after 1 tries got buffer 20 after 1 tries got buffer 22 after 2 tries got buffer 23 after 1 tries got buffer 24 after 1 tries got buffer 25 after 1 tries got buffer 28 after 3 tries got buffer 29 after 1 tries got buffer 30 after 1 tries got buffer 2 after 4 tries Yes, autovacuum is off, and bgwriter shouldn't have anything useful to do either, so I'm a bit at a loss what's going on --- but in any case, it doesn't look like we are cycling through the entire buffer space for each fetch. regards, tom lane
Tom, > Yes, autovacuum is off, and bgwriter shouldn't have anything useful to > do either, so I'm a bit at a loss what's going on --- but in any case, > it doesn't look like we are cycling through the entire buffer space > for each fetch. I'd be happy to DTrace it, but I'm a little lost as to where to look in the kernel. I'll see if I can find someone who knows more about memory management than me (that ought to be easy). -- Josh Berkus PostgreSQL @ Sun San Francisco
Tom, On 3/5/07 8:53 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > Hm, that seems to blow the "it's an L2 cache effect" theory out of the > water. If it were a cache effect then there should be a performance > cliff at the point where the cache size is exceeded. I see no such > cliff, in fact the middle part of the curve is darn near a straight > line on a log scale ... Here's that cliff you were looking for: Size of Orders table: 7178MB Blocksize: 8KB Shared_buffers Select Count Vacuum (KB) (s) (s) ======================================= 248 5.52 2.46 368 4.77 2.40 552 5.82 2.40 824 6.20 2.43 1232 5.60 3.59 1848 6.02 3.14 2768 5.53 4.56 All of these were run three times and the *lowest* time reported. Also, the behavior of "fast VACUUM after SELECT" begins abruptly at 1232KB of shared_buffers. These are Opterons with 2MB of L2 cache shared between two cores. - Luke
I wrote: > "Pavan Deolasee" <pavan@enterprisedb.com> writes: >> Isn't the size of the shared buffer pool itself acting as a performance >> penalty in this case ? May be StrategyGetBuffer() needs to make multiple >> passes over the buffers before the usage_count of any buffer is reduced >> to zero and the buffer is chosen as replacement victim. > I read that and thought you were onto something, but it's not acting > quite the way I expect. I made a quick hack in StrategyGetBuffer() to > count the number of buffers it looks at before finding a victim. > ... > Yes, autovacuum is off, and bgwriter shouldn't have anything useful to > do either, so I'm a bit at a loss what's going on --- but in any case, > it doesn't look like we are cycling through the entire buffer space > for each fetch. Nope, Pavan's nailed it: the problem is that after using a buffer, the seqscan leaves it with usage_count = 1, which means it has to be passed over once by the clock sweep before it can be re-used. I was misled in the 32-buffer case because catalog accesses during startup had left the buffer state pretty confused, so that there was no long stretch before hitting something available. With a large number of buffers, the behavior is that the seqscan fills all of shared memory with buffers having usage_count 1. Once the clock sweep returns to the first of these buffers, it will have to pass over all of them, reducing all of their counts to 0, before it returns to the first one and finds it now usable. Subsequent tries find a buffer immediately, of course, until we have again filled shared_buffers with usage_count 1 everywhere. So the problem is not so much the clock sweep overhead as that it's paid in a very nonuniform fashion: with N buffers you pay O(N) once every N reads and O(1) the rest of the time. This is no doubt slowing things down enough to delay that one read, instead of leaving it nicely I/O bound all the time. Mark, can you detect "hiccups" in the read rate using your setup? I seem to recall that we've previously discussed the idea of letting the clock sweep decrement the usage_count before testing for 0, so that a buffer could be reused on the first sweep after it was initially used, but that we rejected it as being a bad idea. But at least with large shared_buffers it doesn't sound like such a bad idea. Another issue nearby to this is whether to avoid selecting buffers that are dirty --- IIRC someone brought that up again recently. Maybe predecrement for clean buffers, postdecrement for dirty ones would be a cute compromise. regards, tom lane
Here's four more points on the curve - I'd use a "dirac delta function" for your curve fit ;-) Shared_buffers Select Count Vacuum (KB) (s) (s) ======================================= 248 5.52 2.46 368 4.77 2.40 552 5.82 2.40 824 6.20 2.43 1232 5.60 3.59 1848 6.02 3.14 2768 5.53 4.56 5536 6.05 3.95 8304 5.80 4.37 12456 5.86 4.12 18680 5.83 4.10 28016 6.11 4.46 WRT what you found on the selection algorithm, it might also explain the L2 effects I think. I'm also still of the opinion that polluting the shared buffer cache for a seq scan does not make sense. - Luke On 3/5/07 10:21 AM, "Luke Lonergan" <llonergan@greenplum.com> wrote: > Tom, > > On 3/5/07 8:53 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > >> Hm, that seems to blow the "it's an L2 cache effect" theory out of the >> water. If it were a cache effect then there should be a performance >> cliff at the point where the cache size is exceeded. I see no such >> cliff, in fact the middle part of the curve is darn near a straight >> line on a log scale ... > > Here's that cliff you were looking for: > > Size of Orders table: 7178MB > Blocksize: 8KB > > Shared_buffers Select Count Vacuum > (KB) (s) (s) > ======================================= > 248 5.52 2.46 > 368 4.77 2.40 > 552 5.82 2.40 > 824 6.20 2.43 > 1232 5.60 3.59 > 1848 6.02 3.14 > 2768 5.53 4.56 > > All of these were run three times and the *lowest* time reported. Also, the > behavior of "fast VACUUM after SELECT" begins abruptly at 1232KB of > shared_buffers. > > These are Opterons with 2MB of L2 cache shared between two cores. > > - Luke > > > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >
Tom, > I seem to recall that we've previously discussed the idea of letting the > clock sweep decrement the usage_count before testing for 0, so that a > buffer could be reused on the first sweep after it was initially used, > but that we rejected it as being a bad idea. But at least with large > shared_buffers it doesn't sound like such a bad idea. We did discuss an number of formulas for setting buffers with different clock-sweep numbers, including ones with higher usage_count for indexes and starting numbers of 0 for large seq scans as well as vacuums. However, we didn't have any way to prove that any of these complex algorithms would result in higher performance, so went with the simplest formula, with the idea of tinkering with it when we had more data. So maybe now's the time. Note, though, that the current algorithm is working very, very well for OLTP benchmarks, so we'd want to be careful not to gain performance in one area at the expense of another. In TPCE testing, we've been able to increase shared_buffers to 10GB with beneficial performance effect (numbers posted when I have them) and even found that "taking over RAM" with the shared_buffers (ala Oracle) gave us equivalent performance to using the FS cache. (yes, this means with a little I/O management engineering we could contemplate discarding use of the FS cache for a net performance gain. Maybe for 8.4) -- Josh Berkus PostgreSQL @ Sun San Francisco
Tom Lane wrote: > > Nope, Pavan's nailed it: the problem is that after using a buffer, the > seqscan leaves it with usage_count = 1, which means it has to be passed > over once by the clock sweep before it can be re-used. I was misled in > the 32-buffer case because catalog accesses during startup had left the > buffer state pretty confused, so that there was no long stretch before > hitting something available. With a large number of buffers, the > behavior is that the seqscan fills all of shared memory with buffers > having usage_count 1. Once the clock sweep returns to the first of > these buffers, it will have to pass over all of them, reducing all of > their counts to 0, before it returns to the first one and finds it now > usable. Subsequent tries find a buffer immediately, of course, until we > have again filled shared_buffers with usage_count 1 everywhere. So the > problem is not so much the clock sweep overhead as that it's paid in a > very nonuniform fashion: with N buffers you pay O(N) once every N reads > and O(1) the rest of the time. This is no doubt slowing things down > enough to delay that one read, instead of leaving it nicely I/O bound > all the time. Mark, can you detect "hiccups" in the read rate using > your setup? > > Cool. You posted the same analysis before I could hit the "send" button :) I am wondering whether seqscan would set the usage_count to 1 or to a higher value. usage_count is incremented while unpinning the buffer. Even if we use page-at-a-time mode, won't the buffer itself would get pinned/unpinned every time seqscan returns a tuple ? If thats the case, the overhead would be O(BM_MAX_USAGE_COUNT * N) for every N reads. > I seem to recall that we've previously discussed the idea of letting the > clock sweep decrement the usage_count before testing for 0, so that a > buffer could be reused on the first sweep after it was initially used, > but that we rejected it as being a bad idea. But at least with large > shared_buffers it doesn't sound like such a bad idea. > > How about smaller value for BM_MAX_USAGE_COUNT ? > Another issue nearby to this is whether to avoid selecting buffers that > are dirty --- IIRC someone brought that up again recently. Maybe > predecrement for clean buffers, postdecrement for dirty ones would be a > cute compromise. > Can we use a 2-bit counter where the higher bit is set if the buffer is dirty and lower bit is set whenever the buffer is used. The clock-sweep then decrement this counter and chooses a victim with counter value 0. ISTM that we should optimize for large shared buffer pool case, because that would be more common in the coming days. RAM is getting cheaper everyday. Thanks, Pavan
"Pavan Deolasee" <pavan@enterprisedb.com> writes: > I am wondering whether seqscan would set the usage_count to 1 or to a higher > value. usage_count is incremented while unpinning the buffer. Even if > we use > page-at-a-time mode, won't the buffer itself would get pinned/unpinned > every time seqscan returns a tuple ? If thats the case, the overhead would > be O(BM_MAX_USAGE_COUNT * N) for every N reads. No, it's only once per page. There's a good deal of PrivateRefCount thrashing that goes on while examining the individual tuples, but the shared state only changes when we leave the page, because we hold pin continuously on the current page of a seqscan. If you don't believe me, insert some debug printouts for yourself. > How about smaller value for BM_MAX_USAGE_COUNT ? This is not relevant to the problem: we are concerned about usage count 1 versus 0, not the other end of the range. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > I seem to recall that we've previously discussed the idea of letting the > clock sweep decrement the usage_count before testing for 0, so that a > buffer could be reused on the first sweep after it was initially used, > but that we rejected it as being a bad idea. But at least with large > shared_buffers it doesn't sound like such a bad idea. I seem to recall the classic clock sweep algorithm was to just use a single bit. Either a buffer was touched recently or it wasn't. I also vaguely recall a refinement involving keeping a bitmask and shifting it right each time the clock hand comes around. So you have a precise history of which recent clock sweeps the buffer was used and which it wasn't. I think the coarseness of not caring how heavily it was used is a key part of the algorithm. By not caring if it was lightly or heavily used during the clock sweep, just that it was used at least once it avoids making sticking with incorrect deductions about things like sequential scans even if multiple sequential scans pass by. As soon as they stop seeing the buffer it immediately reacts and discards the buffer. I would check out my OS book from school but it's on another continent :( -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote: > Tom, > > > I seem to recall that we've previously discussed the idea of letting the > > clock sweep decrement the usage_count before testing for 0, so that a > > buffer could be reused on the first sweep after it was initially used, > > but that we rejected it as being a bad idea. But at least with large > > shared_buffers it doesn't sound like such a bad idea. > Note, though, that the current algorithm is working very, very well for OLTP > benchmarks, so we'd want to be careful not to gain performance in one area at > the expense of another. Agreed. What we should also add to the analysis is that this effect only occurs when only uniform workloads is present, like SeqScan, VACUUM or COPY. When you have lots of indexed access the scan workloads don't have as much effect on the cache pollution as we are seeing in these tests. Itakgaki-san and I were discussing in January the idea of cache-looping, whereby a process begins to reuse its own buffers in a ring of ~32 buffers. When we cycle back round, if usage_count==1 then we assume that we can reuse that buffer. This avoids cache swamping for read and write workloads, plus avoids too-frequent WAL writing for VACUUM. It would be simple to implement the ring buffer and enable/disable it with a hint StrategyHintCyclicBufferReuse() in a similar manner to the hint VACUUM provides now. This would maintain the beneficial behaviour for OLTP, while keeping data within the L2 cache for DSS and bulk workloads. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > Itakgaki-san and I were discussing in January the idea of cache-looping, > whereby a process begins to reuse its own buffers in a ring of ~32 > buffers. When we cycle back round, if usage_count==1 then we assume that > we can reuse that buffer. This avoids cache swamping for read and write > workloads, plus avoids too-frequent WAL writing for VACUUM. > This would maintain the beneficial behaviour for OLTP, Justify that claim. It sounds to me like this would act very nearly the same as having shared_buffers == 32 ... regards, tom lane
<p><font size="2">This sounds like a good idea.<br /><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br /> -----OriginalMessage-----<br /> From: Simon Riggs [<a href="mailto:simon@2ndquadrant.com">mailto:simon@2ndquadrant.com</a>]<br/> Sent: Monday, March 05, 2007 02:37 PM EasternStandard Time<br /> To: Josh Berkus; Tom Lane; Pavan Deolasee; Mark Kirkwood; Gavin Sherry; Luke Lonergan; PGSQLHackers; Doug Rady; Sherry Moore<br /> Cc: pgsql-hackers@postgresql.org<br /> Subject: Re: [HACKERS] Bug:Buffer cache is not scan resistant<br /><br /> On Mon, 2007-03-05 at 10:46 -0800, Josh Berkus wrote:<br /> > Tom,<br/> ><br /> > > I seem to recall that we've previously discussed the idea of letting the<br /> > > clocksweep decrement the usage_count before testing for 0, so that a<br /> > > buffer could be reused on the firstsweep after it was initially used,<br /> > > but that we rejected it as being a bad idea. But at least with large<br/> > > shared_buffers it doesn't sound like such a bad idea.<br /><br /> > Note, though, that the currentalgorithm is working very, very well for OLTP<br /> > benchmarks, so we'd want to be careful not to gain performancein one area at<br /> > the expense of another.<br /><br /> Agreed.<br /><br /> What we should also add to theanalysis is that this effect only occurs<br /> when only uniform workloads is present, like SeqScan, VACUUM or COPY.<br/> When you have lots of indexed access the scan workloads don't have as<br /> much effect on the cache pollutionas we are seeing in these tests.<br /><br /> Itakgaki-san and I were discussing in January the idea of cache-looping,<br/> whereby a process begins to reuse its own buffers in a ring of ~32<br /> buffers. When we cycle backround, if usage_count==1 then we assume that<br /> we can reuse that buffer. This avoids cache swamping for read andwrite<br /> workloads, plus avoids too-frequent WAL writing for VACUUM.<br /><br /> It would be simple to implement thering buffer and enable/disable it<br /> with a hint StrategyHintCyclicBufferReuse() in a similar manner to the<br /> hintVACUUM provides now.<br /><br /> This would maintain the beneficial behaviour for OLTP, while keeping<br /> data withinthe L2 cache for DSS and bulk workloads.<br /><br /> --<br /> Simon Riggs <br /> EnterpriseDB <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br/><br /><br /><br /></font>
On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote: > "Simon Riggs" <simon@2ndquadrant.com> writes: > > Itakgaki-san and I were discussing in January the idea of cache-looping, > > whereby a process begins to reuse its own buffers in a ring of ~32 > > buffers. When we cycle back round, if usage_count==1 then we assume that > > we can reuse that buffer. This avoids cache swamping for read and write > > workloads, plus avoids too-frequent WAL writing for VACUUM. > > > This would maintain the beneficial behaviour for OLTP, > > Justify that claim. It sounds to me like this would act very nearly the > same as having shared_buffers == 32 ... Sure. We wouldn't set the hint for IndexScans or Inserts, only for SeqScans, VACUUM and COPY. So OLTP-only workloads would be entirely unaffected. In the presence of a mixed workload the scan tasks would have only a limited effect on the cache, maintaining performance for the response time critical tasks. So its an OLTP benefit because of cache protection and WAL-flush reduction during VACUUM. As we've seen, the scan tasks look like they'll go faster with this. The assumption that we can reuse the buffer if usage_count<=1 seems valid. If another user had requested the block, then the usage_count would be > 1, unless the buffer has been used, unpinned and then a cycle of the buffer cache had spun round, all within the time taken to process 32 blocks sequentially. We do have to reuse one of the buffers, so cyclical reuse seems like a better bet most of the time than more arbitrary block reuse, as we see in a larger cache. Best way is to prove it though. Seems like not too much work to have a private ring data structure when the hint is enabled. The extra bookeeping is easily going to be outweighed by the reduction in mem->L2 cache fetches. I'll do it tomorrow, if no other volunteers. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-03-05 at 03:51 -0500, Luke Lonergan wrote: > The Postgres shared buffer cache algorithm appears to have a bug. When > there is a sequential scan the blocks are filling the entire shared > buffer cache. This should be "fixed". > > My proposal for a fix: ensure that when relations larger (much larger?) > than buffer cache are scanned, they are mapped to a single page in the > shared buffer cache. > I don't see why we should strictly limit sequential scans to one buffer per scan. I assume you mean one buffer per scan, but that raises these two questions: (1) What happens when there are more seq scans than cold buffers available? (2) What happens when two sequential scans need the same page, do we double-up? Also, the first time we access any heap page of a big table, we are very unsure whether we will access it again, regardless of whether it's part of a seq scan or not. In our current system of 4 LRU lists (depending on how many times a buffer has been referenced), we could start "more likely" (e.g. system catalogs, index pages) pages in higher list, and heap reads from big tables in the lowest possible list. Assuming, of course, that has any benefit (frequently accessed cache pages are likely to move up in the lists very quickly anyway). But I don't think we should eliminate caching of heap pages in big tables all together. A few buffers might be polluted during the scan, but most of the time they will be replacing other low-priority pages (perhaps from another seq scan) and probably be replaced again quickly. If that's not happening, and it's polluting frequently-accessed pages, I agree that's a bug. Regards,Jeff Davis
On Mon, 2007-03-05 at 11:10 +0200, Hannu Krosing wrote: > > My proposal for a fix: ensure that when relations larger (much larger?) > > than buffer cache are scanned, they are mapped to a single page in the > > shared buffer cache. > > How will this approach play together with synchronized scan patches ? > Thanks for considering my patch in this discussion. I will test by turning shared_buffers down as low as I can, and see if that makes a big difference. > Or should synchronized scan rely on systems cache only ? > I don't know what the performance impact of that will be; still good compared to reading from disk, but I assume much more CPU time. Regards,Jeff Davis
"Simon Riggs" <simon@2ndquadrant.com> writes: > Best way is to prove it though. Seems like not too much work to have a > private ring data structure when the hint is enabled. The extra > bookeeping is easily going to be outweighed by the reduction in mem->L2 > cache fetches. I'll do it tomorrow, if no other volunteers. [ shrug... ] No one has yet proven to my satisfaction that L2 cache has anything to do with this. The notion that you can read a new disk page into a shared buffer and have that buffer still be live in the processor cache is so obviously bogus that I think there must be some other effect at work. regards, tom lane
On Mon, 2007-03-05 at 09:09 +0000, Heikki Linnakangas wrote: > In fact, the pages that are left in the cache after the seqscan finishes > would be useful for the next seqscan of the same table if we were smart > enough to read those pages first. That'd make a big difference for > seqscanning a table that's say 1.5x your RAM size. Hmm, I wonder if > Jeff's sync seqscan patch adresses that. > Absolutely. I've got a parameter in my patch "sync_scan_offset" that starts a seq scan N pages before the position of the last seq scan running on that table (or a current seq scan if there's still a scan going). If the last scan is not still in progress, the pages are less likely to be in cache. If the pages are in cache, great; if not, it doesn't matter where we start anyway. If the last scan is still in progress, those recently-read pages are very likely to be in cache (shared buffers or OS buffer cache). Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > Absolutely. I've got a parameter in my patch "sync_scan_offset" that > starts a seq scan N pages before the position of the last seq scan > running on that table (or a current seq scan if there's still a scan > going). Strikes me that expressing that parameter as a percentage of shared_buffers might make it less in need of manual tuning ... regards, tom lane
On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > Absolutely. I've got a parameter in my patch "sync_scan_offset" that > > starts a seq scan N pages before the position of the last seq scan > > running on that table (or a current seq scan if there's still a scan > > going). > > Strikes me that expressing that parameter as a percentage of > shared_buffers might make it less in need of manual tuning ... > The original patch was a percentage of effective_cache_size, because in theory it may be helpful to have this parameter larger than shared buffers. Synchronized Scannning can take advantage of OS buffer cache as well. Someone convinced me to change it to be an independent variable. I don't have a strong opinion, but now I have three different opinions: (1) Independent parameter (2) Percentage of shared_buffers (3) Percentage of effective_cache_size Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote: >> Strikes me that expressing that parameter as a percentage of >> shared_buffers might make it less in need of manual tuning ... > The original patch was a percentage of effective_cache_size, because in > theory it may be helpful to have this parameter larger than shared > buffers. Synchronized Scannning can take advantage of OS buffer cache as > well. I didn't say you couldn't allow it to be more than 100% ;-). But basing it on effective_cache_size strikes me as a bad idea because that parameter is seldom better than a wild guess. shared_buffers at least means something. regards, tom lane
Jeff Davis wrote: > On Mon, 2007-03-05 at 15:30 -0500, Tom Lane wrote: >> Jeff Davis <pgsql@j-davis.com> writes: >>> Absolutely. I've got a parameter in my patch "sync_scan_offset" that >>> starts a seq scan N pages before the position of the last seq scan >>> running on that table (or a current seq scan if there's still a scan >>> going). >> Strikes me that expressing that parameter as a percentage of >> shared_buffers might make it less in need of manual tuning ... >> > > The original patch was a percentage of effective_cache_size, because in > theory it may be helpful to have this parameter larger than shared > buffers. Synchronized Scannning can take advantage of OS buffer cache as > well. > > Someone convinced me to change it to be an independent variable. > > I don't have a strong opinion, but now I have three different opinions: > (1) Independent parameter > (2) Percentage of shared_buffers > (3) Percentage of effective_cache_size Another approach I proposed back in December is to not have a variable like that at all, but scan the buffer cache for pages belonging to the table you're scanning to initialize the scan. Scanning all the BufferDescs is a fairly CPU and lock heavy operation, but it might be ok given that we're talking about large I/O bound sequential scans. It would require no DBA tuning and would work more robustly in varying conditions. I'm not sure where you would continue after scanning the in-cache pages. At the highest in-cache block number, perhaps. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Tom Lane wrote: > So the > problem is not so much the clock sweep overhead as that it's paid in a > very nonuniform fashion: with N buffers you pay O(N) once every N reads > and O(1) the rest of the time. This is no doubt slowing things down > enough to delay that one read, instead of leaving it nicely I/O bound > all the time. Mark, can you detect "hiccups" in the read rate using > your setup? > I think so, here's the vmstat output for 400MB of shared_buffers during the scan: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 764 51772 0 1990688 0 0 120422 2 1546 1755 16 37 46 1 1 0 764 53640 0 1988792 0 0 120422 2 1544 1446 14 40 46 1 1 0 788 54900 0 1987564 0 0 116746 15 1470 3067 15 39 44 2 1 0 788 52800 0 1989552 0 0 119199 20 1488 2216 14 37 47 1 1 0 788 52372 0 1990000 0 0 122880 7 1532 1203 15 39 45 1 1 0 788 54592 0 1987872 0 5 124928 5 1557 1058 17 38 46 0 2 0 788 54052 0 1987836 0 0 118787 0 1500 2469 16 36 47 1 1 0 788 52552 0 1989892 0 0 120419 0 1506 2531 15 36 48 1 1 0 788 53452 0 1989356 0 0 119195 2 1501 1698 15 37 47 1 1 0 788 52680 0 1989796 0 0 120424 2 1521 1610 16 37 47 1 Cheers Mark
Mark Kirkwood <markir@paradise.net.nz> writes: > Tom Lane wrote: >> Mark, can you detect "hiccups" in the read rate using >> your setup? > I think so, here's the vmstat output for 400MB of shared_buffers during > the scan: Hm, not really a smoking gun there. But just for grins, would you try this patch and see if the numbers change? regards, tom lane *** src/backend/storage/buffer/freelist.c.orig Fri Jan 5 18:02:12 2007 --- src/backend/storage/buffer/freelist.c Mon Mar 5 16:33:11 2007 *************** *** 104,116 **** * it; decrement the usage_count and keep scanning. */ LockBufHdr(buf); - if (buf->refcount == 0 && buf->usage_count == 0) - return buf; if (buf->usage_count > 0) { buf->usage_count--; ! trycounter = NBuffers; } else if (--trycounter == 0) { /* --- 104,116 ---- * it; decrement the usage_count and keep scanning. */ LockBufHdr(buf); if (buf->usage_count > 0) { buf->usage_count--; ! trycounter = NBuffers + 1; } + if (buf->refcount == 0 && buf->usage_count == 0) + return buf; else if (--trycounter == 0) { /*
Tom Lane wrote: > > Hm, not really a smoking gun there. But just for grins, would you try > this patch and see if the numbers change? > Applied to 8.2.3 (don't have lineitem loaded in HEAD yet) - no change that I can see: procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 1 0 764 55300 0 1988032 0 0 118797 32 1532 1775 15 39 44 2 1 0 764 54476 0 1989720 0 0 115507 9 1456 3970 15 39 45 1 1 0 788 54540 0 1989592 0 0 121651 0 1508 3221 16 37 47 0 1 0 788 52808 0 1991320 0 0 124109 0 1532 1236 15 38 46 0 1 0 788 52504 0 1991784 0 0 124518 0 1547 1005 16 39 45 0 2 0 788 54544 0 1989740 0 5 117965 5 1491 2012 15 36 47 2 1 0 788 53596 0 1991184 0 0 120424 0 1504 1910 16 37 46 1 Elapsed time is exactly the same (101 s). Is is expected that HEAD would behave differently? Cheers Mark
Mark Kirkwood <markir@paradise.net.nz> writes: > Elapsed time is exactly the same (101 s). Is is expected that HEAD would > behave differently? Offhand I don't think so. But what I wanted to see was the curve of elapsed time vs shared_buffers? regards, tom lane
On Mon, 2007-03-05 at 21:03 +0000, Heikki Linnakangas wrote: > Another approach I proposed back in December is to not have a variable > like that at all, but scan the buffer cache for pages belonging to the > table you're scanning to initialize the scan. Scanning all the > BufferDescs is a fairly CPU and lock heavy operation, but it might be ok > given that we're talking about large I/O bound sequential scans. It > would require no DBA tuning and would work more robustly in varying > conditions. I'm not sure where you would continue after scanning the > in-cache pages. At the highest in-cache block number, perhaps. > I assume you're referring to this: "each backend keeps a bitmap of pages it has processed during the scan, and read the pages in the order they're available in cache." which I think is a great idea. However, I was unable to devise a good answer to all these questions at once: * How do we attempt to maintain sequential reads on the underlying I/O layer? * My current implementation takes advantage of the OS buffer cache, how could we maintain that advantage from PostgreSQL-specific cache logic? * How do I test to see whether it actually helps in a realistic scenario? It seems like it would help the most when scans are progressing at different rates, but how often do people have CPU-bound queries on tables that don't fit into physical memory (and how long would it take for me to benchmark such a query)? It seems like your idea is more analytical, and my current implementation is more guesswork. I like the analytical approach, but I don't know that we have enough information to pull it off because we're missing what's in the OS buffer cache. The OS buffer cache is crucial to Synchronized Scanning, because shared buffers are evicted based on a more complex set of circumstances, whereas the OS buffer cache is usually LRU and forms a nicer "cache trail" (upon which Synchronized Scanning is largely based). If you have some tests you'd like me to run, I'm planning to do some benchmarks this week and next. I can see if my current patch holds up under the scenarios you're worried about. Regards,Jeff Davis
Simon Riggs wrote: > On Mon, 2007-03-05 at 14:41 -0500, Tom Lane wrote: >> "Simon Riggs" <simon@2ndquadrant.com> writes: >>> Itakgaki-san and I were discussing in January the idea of cache-looping, >>> whereby a process begins to reuse its own buffers in a ring of ~32 >>> buffers. When we cycle back round, if usage_count==1 then we assume that >>> we can reuse that buffer. This avoids cache swamping for read and write >>> workloads, plus avoids too-frequent WAL writing for VACUUM. >>> This would maintain the beneficial behaviour for OLTP, >> Justify that claim. It sounds to me like this would act very nearly the >> same as having shared_buffers == 32 ... > > Sure. We wouldn't set the hint for IndexScans or Inserts, only for > SeqScans, VACUUM and COPY. > > So OLTP-only workloads would be entirely unaffected. In the presence of > a mixed workload the scan tasks would have only a limited effect on the > cache, maintaining performance for the response time critical tasks. So > its an OLTP benefit because of cache protection and WAL-flush reduction > during VACUUM. > > As we've seen, the scan tasks look like they'll go faster with this. But it would break the idea of letting a second seqscan follow in the first's hot cache trail, no? greetings, Florian Pflug
Tom Lane wrote: > > But what I wanted to see was the curve of > elapsed time vs shared_buffers? > Of course! (lets just write that off to me being pre coffee...). With the patch applied: Shared Buffers Elapsed vmstat IO rate -------------- ------- -------------- 400MB 101 s 122 MB/s 2MB 101 s 1MB 97 s 768KB 94 s 512KB 84 s 256KB 79 s 128KB 75 s 166 MB/s Looks *very* similar. Cheers Mark
Mark Kirkwood <markir@paradise.net.nz> writes: > Tom Lane wrote: >> But what I wanted to see was the curve of >> elapsed time vs shared_buffers? > ... > Looks *very* similar. Yup, thanks for checking. I've been poking into this myself. I find that I can reproduce the behavior to some extent even with a slow disk drive (this machine is a dual 2.8GHz Xeon EM64T running Fedora Core 5; the dd-to-dev-null test shows the disk read speed as 43MB/sec or so). Test case is a several-gig table, no indexes, fully vacuumed so that neither VACUUM nor COUNT(*) have to do anything but seqscan as fast as they can. Given a *freshly started* postmaster, I see regression=# show shared_buffers;shared_buffers ----------------128MB (1 row) regression=# \timing Timing is on. regression=# vacuum lineitem; VACUUM Time: 63652.333 ms regression=# vacuum lineitem; VACUUM Time: 63562.303 ms regression=# select count(*) from lineitem; count ----------10240000 (1 row) Time: 63142.174 ms regression=# vacuum lineitem; VACUUM Time: 61638.421 ms regression=# vacuum lineitem; VACUUM Time: 61785.905 ms I didn't show it here, but you can repeat the VACUUM all you want before the SELECT, and its times are stable; and you can repeat all you want after the SELECT, and the times are stable but a couple seconds lower. Restart the postmaster and it goes back to the slower behavior. (I'm keeping autovac off so it doesn't change the results.) I decided to get down and dirty with oprofile, and soon found that the user-space CPU consumption is indistinguishable in both states: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 GLOBAL_POWER_E...| samples| %| ------------------ 141065 73.8193 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux 26368 13.7984 /home/tgl/testversion/bin/postgres 12765 6.6799 /libata 2238 1.1711 /lib64/libc-2.4.so 1112 0.5819 /dm_mod CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 GLOBAL_POWER_E...| samples| %| ------------------ 113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux 26284 16.3070 /home/tgl/testversion/bin/postgres 12004 7.4475 /libata 2093 1.2985 /lib64/libc-2.4.so 996 0.6179 /dm_mod Inside the kernel, there's only one routine that's significantly different: CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 samples % symbol name 57779 40.9591 copy_user_generic 18175 12.8841 __delay 3994 2.8313 _raw_spin_lock 2388 1.6928 put_page 2184 1.5482 mwait_idle 2083 1.4766 _raw_write_unlock 1909 1.3533 _raw_write_lock CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated) Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped) with a unit mask of 0x01 (mandatory) count 240000 samples % symbol name 37437 33.0783 copy_user_generic 17891 15.8080 __delay 3372 2.9794 _raw_spin_lock 2218 1.9598 mwait_idle 2067 1.8263 _raw_write_unlock 1837 1.6231 _raw_write_lock 1531 1.3527 put_page So that's part of the mystery: apparently copy_user_generic is coded in such a way that it's faster to copy into memory that's already in processor cache. This strikes me as something that probably could/should be fixed in the kernel; I don't see any good reason why overwriting a whole cache line oughtn't be the same speed either way. The other thing that was bothering me is why does the SELECT change VACUUM's behavior? A debugging printout added to ReadBuffer gave the answer: after postmaster start, we see things like read block 353094 into buffer 11386 read block 353095 into buffer 11387 read block 353096 into buffer 11388 read block 353097 into buffer 11389 read block 353098 into buffer 11390 read block 353099 into buffer 11391 read block 353100 into buffer 11392 read block 353101 into buffer 11393 read block 353102 into buffer 11394 read block 353103 into buffer 11395 and after the SELECT it behaves like read block 336761 into buffer 9403 read block 336762 into buffer 9402 read block 336763 into buffer 9403 read block 336764 into buffer 9402 read block 336765 into buffer 9403 read block 336766 into buffer 9402 read block 336767 into buffer 9403 read block 336768 into buffer 9402 read block 336769 into buffer 9403 read block 336770 into buffer 9402 What's going on is that VACUUM puts each buffer it's finished with on the tail of the freelist. In the post-SELECT state, there are just two buffers cycling through the freelist (not sure why not only one, but it doesn't matter) and so the cache footprint is limited. But immediately after postmaster start, (nearly) all the buffers are in the freelist and so they all cycle through VACUUM's usage. In any real-world situation, of course, the freelist is going to be nearly empty most of the time and so I don't think this part is worth changing. So I now concede Luke's argument that this behavior is related to L2 cache usage. But the next question is whether we ought to change regular seqscan's behavior to mimic VACUUM. I'm very worried about pessimizing other cases if we do. ISTM there's a fairly clear case that this might be fixable at the kernel level. Moreover, the issue only arises because of the way that the copy-from-kernel-buffer-to-userspace routine behaves, which means that if we go to a regime where we stop relying on OS caching and ask for direct DMA into our buffers, the advantage would disappear anyway. Lastly, I think the case where a win is possible is fairly narrow --- as soon as you've got anything but the one seqscan going on, it's not going to help. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > I don't see any good reason why overwriting a whole cache line oughtn't be > the same speed either way. I can think of a couple theories, but I don't know if they're reasonable. The one the comes to mind is the inter-processor cache coherency protocol. When writing to a cache line the processor already owns maybe it can skip having to check for other processors owning that cache line? What happens if VACUUM comes across buffers that *are* already in the buffer cache. Does it throw those on the freelist too? That seems like it would be dangerous if they were in the buffer cache for a reason. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
<p><font size="2">Hi Tom,<br /><br /> Good info - it's the same in Solaris, the routine is uiomove (Sherry wrote it).<br/><br /><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br /> -----Original Message-----<br /> From: Tom Lane [<a href="mailto:tgl@sss.pgh.pa.us">mailto:tgl@sss.pgh.pa.us</a>]<br /> Sent: Monday, March 05, 2007 07:43PM Eastern Standard Time<br /> To: Mark Kirkwood<br /> Cc: Pavan Deolasee; Gavin Sherry; Luke Lonergan; PGSQLHackers; Doug Rady; Sherry Moore<br /> Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant<br /><br/> Mark Kirkwood <markir@paradise.net.nz> writes:<br /> > Tom Lane wrote:<br /> >> But what I wantedto see was the curve of<br /> >> elapsed time vs shared_buffers?<br /> > ...<br /> > Looks *very* similar.<br/><br /> Yup, thanks for checking.<br /><br /> I've been poking into this myself. I find that I can reproducethe<br /> behavior to some extent even with a slow disk drive (this machine is a<br /> dual 2.8GHz Xeon EM64T runningFedora Core 5; the dd-to-dev-null test<br /> shows the disk read speed as 43MB/sec or so). Test case is a<br /> several-gigtable, no indexes, fully vacuumed so that neither VACUUM nor<br /> COUNT(*) have to do anything but seqscan asfast as they can. Given a<br /> *freshly started* postmaster, I see<br /><br /> regression=# show shared_buffers;<br /> shared_buffers<br /> ----------------<br /> 128MB<br /> (1 row)<br /><br /> regression=# \timing<br /> Timing is on.<br/> regression=# vacuum lineitem;<br /> VACUUM<br /> Time: 63652.333 ms<br /> regression=# vacuum lineitem;<br /> VACUUM<br/> Time: 63562.303 ms<br /> regression=# select count(*) from lineitem;<br /> count<br /> ----------<br /> 10240000<br/> (1 row)<br /><br /> Time: 63142.174 ms<br /> regression=# vacuum lineitem;<br /> VACUUM<br /> Time: 61638.421ms<br /> regression=# vacuum lineitem;<br /> VACUUM<br /> Time: 61785.905 ms<br /><br /> I didn't show it here,but you can repeat the VACUUM all you want before<br /> the SELECT, and its times are stable; and you can repeat allyou want<br /> after the SELECT, and the times are stable but a couple seconds lower.<br /> Restart the postmaster andit goes back to the slower behavior. (I'm<br /> keeping autovac off so it doesn't change the results.)<br /><br /> Idecided to get down and dirty with oprofile, and soon found that the<br /> user-space CPU consumption is indistinguishablein both states:<br /><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br /> CountedGLOBAL_POWER_EVENTS events (time during which processor is not stopped)<br /> with a unit mask of 0x01 (mandatory)count 240000<br /> GLOBAL_POWER_E...|<br /> samples| %|<br /> ------------------<br /> 141065 73.8193/usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux<br /> 26368 13.7984 /home/tgl/testversion/bin/postgres<br/> 12765 6.6799 /libata<br /> 2238 1.1711 /lib64/libc-2.4.so<br /> 1112 0.5819 /dm_mod<br /><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br /> Counted GLOBAL_POWER_EVENTSevents (time during which processor is not stopped)<br /> with a unit mask of 0x01 (mandatory) count 240000<br/> GLOBAL_POWER_E...|<br /> samples| %|<br /> ------------------<br /> 113177 70.2169 /usr/lib/debug/lib/modules/2.6.18-1.2200.fc5/vmlinux<br/> 26284 16.3070 /home/tgl/testversion/bin/postgres<br /> 12004 7.4475 /libata<br /> 2093 1.2985 /lib64/libc-2.4.so<br /> 996 0.6179 /dm_mod<br /><br /> Inside thekernel, there's only one routine that's significantly different:<br /><br /> CPU: P4 / Xeon with 2 hyper-threads, speed2793.08 MHz (estimated)<br /> Counted GLOBAL_POWER_EVENTS events (time during which processor is not stopped)<br />with a unit mask of 0x01 (mandatory) count 240000<br /> samples % symbol name<br /> 57779 40.9591 copy_user_generic<br/> 18175 12.8841 __delay<br /> 3994 2.8313 _raw_spin_lock<br /> 2388 1.6928 put_page<br/> 2184 1.5482 mwait_idle<br /> 2083 1.4766 _raw_write_unlock<br /> 1909 1.3533 _raw_write_lock<br/><br /> CPU: P4 / Xeon with 2 hyper-threads, speed 2793.08 MHz (estimated)<br /> Counted GLOBAL_POWER_EVENTSevents (time during which processor is not stopped)<br /> with a unit mask of 0x01 (mandatory) count 240000<br/> samples % symbol name<br /> 37437 33.0783 copy_user_generic<br /> 17891 15.8080 __delay<br />3372 2.9794 _raw_spin_lock<br /> 2218 1.9598 mwait_idle<br /> 2067 1.8263 _raw_write_unlock<br /> 1837 1.6231 _raw_write_lock<br /> 1531 1.3527 put_page<br /><br /> So that's part of the mystery: apparently copy_user_genericis coded in<br /> such a way that it's faster to copy into memory that's already in<br /> processor cache. This strikes me as something that probably<br /> could/should be fixed in the kernel; I don't see any good reasonwhy<br /> overwriting a whole cache line oughtn't be the same speed either way.<br /><br /> The other thing that wasbothering me is why does the SELECT change<br /> VACUUM's behavior? A debugging printout added to ReadBuffer gave the<br/> answer: after postmaster start, we see things like<br /><br /> read block 353094 into buffer 11386<br /> read block353095 into buffer 11387<br /> read block 353096 into buffer 11388<br /> read block 353097 into buffer 11389<br /> readblock 353098 into buffer 11390<br /> read block 353099 into buffer 11391<br /> read block 353100 into buffer 11392<br/> read block 353101 into buffer 11393<br /> read block 353102 into buffer 11394<br /> read block 353103 into buffer11395<br /><br /> and after the SELECT it behaves like<br /><br /> read block 336761 into buffer 9403<br /> read block336762 into buffer 9402<br /> read block 336763 into buffer 9403<br /> read block 336764 into buffer 9402<br /> readblock 336765 into buffer 9403<br /> read block 336766 into buffer 9402<br /> read block 336767 into buffer 9403<br />read block 336768 into buffer 9402<br /> read block 336769 into buffer 9403<br /> read block 336770 into buffer 9402<br/><br /> What's going on is that VACUUM puts each buffer it's finished with on<br /> the tail of the freelist. Inthe post-SELECT state, there are just two<br /> buffers cycling through the freelist (not sure why not only one, but it<br/> doesn't matter) and so the cache footprint is limited. But immediately<br /> after postmaster start, (nearly) allthe buffers are in the freelist and<br /> so they all cycle through VACUUM's usage. In any real-world situation,<br />of course, the freelist is going to be nearly empty most of the time and<br /> so I don't think this part is worth changing.<br/><br /> So I now concede Luke's argument that this behavior is related to L2<br /> cache usage. But the nextquestion is whether we ought to change<br /> regular seqscan's behavior to mimic VACUUM. I'm very worried about<br />pessimizing other cases if we do. ISTM there's a fairly clear case that<br /> this might be fixable at the kernel level. Moreover, the issue only<br /> arises because of the way that the copy-from-kernel-buffer-to-userspace<br /> routinebehaves, which means that if we go to a regime where we stop<br /> relying on OS caching and ask for direct DMA intoour buffers, the<br /> advantage would disappear anyway. Lastly, I think the case where a win<br /> is possible is fairlynarrow --- as soon as you've got anything but the<br /> one seqscan going on, it's not going to help.<br /><br /> regards, tom lane<br /><br /></font>
Gregory Stark <stark@enterprisedb.com> writes: > What happens if VACUUM comes across buffers that *are* already in the buffer > cache. Does it throw those on the freelist too? Not unless they have usage_count 0, in which case they'd be subject to recycling by the next clock sweep anyway. regards, tom lane
"Luke Lonergan" <LLonergan@greenplum.com> writes: > Good info - it's the same in Solaris, the routine is uiomove (Sherry > wrote it). Cool. Maybe Sherry can comment on the question whether it's possible for a large-scale-memcpy to not take a hit on filling a cache line that wasn't previously in cache? I looked a bit at the Linux code that's being used here, but it's all x86_64 assembler which is something I've never studied :-(. regards, tom lane
Tom, On 3/5/07 7:58 PM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote: > I looked a bit at the Linux code that's being used here, but it's all > x86_64 assembler which is something I've never studied :-(. Here's the C wrapper routine in Solaris: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/ move.c Here's the x86 assembler routine for Solaris: http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32 /ml/copy.s The actual uiomove routine is a simple wrapper that calls the assembler kcopy or xcopyout routines. There are two versions (for Opteron), one that uses the NTA instructions that bypass the L2 cache on writing to avoid L2 cache pollution, and the second writes normally - through the L2 cache. Which one is used depends on a parameter (global) based on the size of the I/O. It is tuned to identify operations that might pollute the L2 cache (sound familiar?) I think what we're seeing is a generic artifact of the write-through behavior of the cache. I wouldn't expect this to get any better with DIRECTIO to the shared_buffers in pgsql - if we iterate over a large number of user space buffers we'll still hit the increased L2 thrashing. I think we're best off with a hybrid approach - when we "detect" a seq scan larger (much larger?) than buffer cache, we can switch into the "cache bypass" behavior, much like the above code uses the NTA instruction when appropriate. We can handle syncscan using a small buffer space. - Luke
"Luke Lonergan" <llonergan@greenplum.com> writes: > Here's the x86 assembler routine for Solaris: > http://cvs.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/intel/ia32 > /ml/copy.s > The actual uiomove routine is a simple wrapper that calls the assembler > kcopy or xcopyout routines. There are two versions (for Opteron), one that > uses the NTA instructions that bypass the L2 cache on writing to avoid L2 > cache pollution, and the second writes normally - through the L2 cache. Hm. If it were bypassing the L2 cache then the hit would be taken when PG later tries to read the data. This is clearly not what's happening in the Linux system, but I've not seen any oprofile-equivalent data for Solaris? regards, tom lane
On Mar 5, 2007, at 11:46 AM, Josh Berkus wrote: > Tom, > >> I seem to recall that we've previously discussed the idea of >> letting the >> clock sweep decrement the usage_count before testing for 0, so that a >> buffer could be reused on the first sweep after it was initially >> used, >> but that we rejected it as being a bad idea. But at least with large >> shared_buffers it doesn't sound like such a bad idea. > > We did discuss an number of formulas for setting buffers with > different > clock-sweep numbers, including ones with higher usage_count for > indexes and > starting numbers of 0 for large seq scans as well as vacuums. > However, we > didn't have any way to prove that any of these complex algorithms > would > result in higher performance, so went with the simplest formula, > with the > idea of tinkering with it when we had more data. So maybe now's > the time. > > Note, though, that the current algorithm is working very, very well > for OLTP > benchmarks, so we'd want to be careful not to gain performance in > one area at > the expense of another. In TPCE testing, we've been able to increase > shared_buffers to 10GB with beneficial performance effect (numbers > posted > when I have them) and even found that "taking over RAM" with the > shared_buffers (ala Oracle) gave us equivalent performance to using > the FS > cache. (yes, this means with a little I/O management engineering > we could > contemplate discarding use of the FS cache for a net performance > gain. Maybe > for 8.4) An idea I've been thinking about would be to have the bgwriter or some other background process actually try and keep the free list populated, so that backends needing to grab a page would be much more likely to find one there (and not have to wait to scan through the entire buffer pool, perhaps multiple times). My thought is to keep track of how many page requests occurred during a given interval, and use that value (probably averaged over time) to determine how many pages we'd like to see on the free list. The background process would then run through the buffers decrementing usage counts until it found enough for the free list. Before putting a buffer on the 'free list', it would write the buffer out; I'm not sure if it would make sense to de-associate the buffer with whatever it had been storing or not, though. If we don't do that, that would mean that we could pull pages back off the free list if we wanted to. That would be helpful if the background process got a bit over-zealous. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: > Another approach I proposed back in December is to not have a > variable like that at all, but scan the buffer cache for pages > belonging to the table you're scanning to initialize the scan. > Scanning all the BufferDescs is a fairly CPU and lock heavy > operation, but it might be ok given that we're talking about large > I/O bound sequential scans. It would require no DBA tuning and > would work more robustly in varying conditions. I'm not sure where > you would continue after scanning the in-cache pages. At the > highest in-cache block number, perhaps. If there was some way to do that, it'd be what I'd vote for. Given the partitioning of the buffer lock that Tom did it might not be that horrible for many cases, either, since you'd only need to scan through one partition. We also don't need an exact count, either. Perhaps there's some way we could keep a counter or something... -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim Nasby <decibel@decibel.org> writes: > An idea I've been thinking about would be to have the bgwriter or > some other background process actually try and keep the free list > populated, The bgwriter already tries to keep pages "just in front" of the clock sweep pointer clean. regards, tom lane
On Tue, 2007-03-06 at 00:54 +0100, Florian G. Pflug wrote: > Simon Riggs wrote: > But it would break the idea of letting a second seqscan follow in the > first's hot cache trail, no? No, but it would make it somewhat harder to achieve without direct synchronization between scans. It could still work; lets see. I'm not sure thats an argument against fixing the problem with the buffer strategy though. We really want both, not just one or the other. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Hi Tom, Sorry about the delay. I have been away from computers all day. In the current Solaris release in development (Code name Nevada, available for download at http://opensolaris.org), I have implemented non-temporal access (NTA) which bypasses L2 for most writes, and reads larger than copyout_max_cached (patchable, default to 128K). The block size used by Postgres is 8KB. If I patch copyout_max_cached to 4KB to trigger NTA for reads, the access time with 16KB buffer or 128MB buffer are very close. I wrote readtest to simulate the access pattern of VACUUM (attached). tread is a 4-socket dual-core Opteron box. <81 tread >./readtest -h Usage: readtest [-v] [-N] -s <size> -n iter [-d delta] [-c count] -v: Verbose mode -N: Normalize results by number of reads -s <size>: Working set size (may specify K,M,G suffix) -n iter: Number of test iterations -f filename: Name of the file to read from -d [+|-]delta: Distance between subsequent reads -c count: Number of reads -h: Print this help With copyout_max_cached at 128K (in nanoseconds, NTA not triggered): <82 tread >./readtest -s 16k -f boot_archive 46445262 <83 tread >./readtest -s 128M -f boot_archive 118294230 <84 tread >./readtest -s 16k -f boot_archive -n 100 4230210856 <85 tread >./readtest -s 128M -f boot_archive -n 100 6343619546 With copyout_max_cached at 4K (in nanoseconds, NTA triggered): <89 tread >./readtest -s 16k -f boot_archive 43606882 <90 tread >./readtest -s 128M -f boot_archive 100547909 <91 tread >./readtest -s 16k -f boot_archive -n 100 4251823995 <92 tread >./readtest -s 128M -f boot_archive -n 100 4205491984 When the iteration is 1 (the default), the timing difference between using 16k buffer and 128M buffer is much bigger for both copyout_max_cached sizes, mostly due to the cost of TLB misses. When the iteration count is bigger, most of the page tables would be in Page Descriptor Cache for the later page accesses so the overhead of TLB misses become smaller. As you can see, when we do bypass L2, the performance with either buffer size is comparable. I am sure your next question is why the 128K limitation for reads. Here are the main reasons: - Based on a lot of the benchmarks and workloads I traced, the target buffer of read operations are typically accessed again shortly after the read, while writes are usually not. Therefore, the default operation mode is to bypass L2 for writes, but not for reads. - The Opteron's L1 cache size is 64K. If reads are larger than 128KB, it would have displacement flushed itself anyway, so for large reads, I will also bypass L2. I am working on dynamically setting copyout_max_cached based on the L1 D-cache size on the system. The above heuristic should have worked well in Luke's test case. However, due to the fact that the reads was done as 16,000 8K reads rather than one 128MB read, the NTA code was not triggered. Since the OS code has to be general enough to handle with most workloads, we have to pick some defaults that might not work best for some specific operations. It is a calculated balance. Thanks, Sherry On Mon, Mar 05, 2007 at 10:58:40PM -0500, Tom Lane wrote: > "Luke Lonergan" <LLonergan@greenplum.com> writes: > > Good info - it's the same in Solaris, the routine is uiomove (Sherry > > wrote it). > > Cool. Maybe Sherry can comment on the question whether it's possible > for a large-scale-memcpy to not take a hit on filling a cache line > that wasn't previously in cache? > > I looked a bit at the Linux code that's being used here, but it's all > x86_64 assembler which is something I've never studied :-(. > > regards, tom lane -- Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym
Attachment
On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: > On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: > > Another approach I proposed back in December is to not have a > > variable like that at all, but scan the buffer cache for pages > > belonging to the table you're scanning to initialize the scan. > > Scanning all the BufferDescs is a fairly CPU and lock heavy > > operation, but it might be ok given that we're talking about large > > I/O bound sequential scans. It would require no DBA tuning and > > would work more robustly in varying conditions. I'm not sure where > > you would continue after scanning the in-cache pages. At the > > highest in-cache block number, perhaps. > > If there was some way to do that, it'd be what I'd vote for. > I still don't know how to make this take advantage of the OS buffer cache. However, no DBA tuning is a huge advantage, I agree with that. If I were to implement this idea, I think Heikki's bitmap of pages already read is the way to go. Can you guys give me some pointers about how to walk through the shared buffers, reading the pages that I need, while being sure not to read a page that's been evicted, and also not potentially causing a performance regression somewhere else? > Given the partitioning of the buffer lock that Tom did it might not > be that horrible for many cases, either, since you'd only need to > scan through one partition. > > We also don't need an exact count, either. Perhaps there's some way > we could keep a counter or something... Exact count of what? The pages already in cache? Regards,Jeff Davis
Jeff Davis <pgsql@j-davis.com> writes: > If I were to implement this idea, I think Heikki's bitmap of pages > already read is the way to go. I think that's a good way to guarantee that you'll not finish in time for 8.3. Heikki's idea is just at the handwaving stage at this point, and I'm not even convinced that it will offer any win. (Pages in cache will be picked up by a seqscan already.) regards, tom lane
On Tue, 2007-03-06 at 12:59 -0500, Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: > > If I were to implement this idea, I think Heikki's bitmap of pages > > already read is the way to go. > > I think that's a good way to guarantee that you'll not finish in time > for 8.3. Heikki's idea is just at the handwaving stage at this point, > and I'm not even convinced that it will offer any win. (Pages in > cache will be picked up by a seqscan already.) > I agree that it's a good idea stick with the current implementation which is, as far as I can see, meeting all of my performance goals. Regards,Jeff Davis
Jeff Davis wrote: > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: >>> Another approach I proposed back in December is to not have a >>> variable like that at all, but scan the buffer cache for pages >>> belonging to the table you're scanning to initialize the scan. >>> Scanning all the BufferDescs is a fairly CPU and lock heavy >>> operation, but it might be ok given that we're talking about large >>> I/O bound sequential scans. It would require no DBA tuning and >>> would work more robustly in varying conditions. I'm not sure where >>> you would continue after scanning the in-cache pages. At the >>> highest in-cache block number, perhaps. >> If there was some way to do that, it'd be what I'd vote for. >> > > I still don't know how to make this take advantage of the OS buffer > cache. Yep, I don't see any way to do that. I think we could live with that, though. If we went with the sync_scan_offset approach, you'd have to leave a lot of safety margin in that as well. > However, no DBA tuning is a huge advantage, I agree with that. > > If I were to implement this idea, I think Heikki's bitmap of pages > already read is the way to go. Can you guys give me some pointers about > how to walk through the shared buffers, reading the pages that I need, > while being sure not to read a page that's been evicted, and also not > potentially causing a performance regression somewhere else? You could take a look at BufferSync, for example. It walks through the buffer cache, syncing all dirty buffers. FWIW, I've attached a function I wrote some time ago when I was playing with the same idea for vacuums. A call to the new function loops through the buffer cache and returns the next buffer that belong to a certain relation. I'm not sure that it's correct and safe, and there's not much comments, but should work if you want to play with it... -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com Index: src/backend/storage/buffer/bufmgr.c =================================================================== RCS file: /home/hlinnaka/pgcvsrepository/pgsql/src/backend/storage/buffer/bufmgr.c,v retrieving revision 1.214 diff -c -r1.214 bufmgr.c *** src/backend/storage/buffer/bufmgr.c 5 Jan 2007 22:19:37 -0000 1.214 --- src/backend/storage/buffer/bufmgr.c 22 Jan 2007 16:38:37 -0000 *************** *** 97,102 **** --- 97,134 ---- static void AtProcExit_Buffers(int code, Datum arg); + Buffer + ReadAnyBufferForRelation(Relation reln) + { + static int last_buf_id = 0; + int new_buf_id; + volatile BufferDesc *bufHdr; + + /* Make sure we will have room to remember the buffer pin */ + ResourceOwnerEnlargeBuffers(CurrentResourceOwner); + + new_buf_id = last_buf_id; + do + { + if (++new_buf_id >= NBuffers) + new_buf_id = 0; + + bufHdr = &BufferDescriptors[new_buf_id]; + LockBufHdr(bufHdr); + + if ((bufHdr->flags & BM_VALID) && RelFileNodeEquals(bufHdr->tag.rnode, reln->rd_node)) + { + PinBuffer_Locked(bufHdr); + last_buf_id = new_buf_id; + return BufferDescriptorGetBuffer(bufHdr); + } + UnlockBufHdr(bufHdr); + } while(new_buf_id != last_buf_id); + last_buf_id = new_buf_id; + return InvalidBuffer; + } + + /* * ReadBuffer -- returns a buffer containing the requested * block of the requested relation. If the blknum
Tom Lane wrote: > Jeff Davis <pgsql@j-davis.com> writes: >> If I were to implement this idea, I think Heikki's bitmap of pages >> already read is the way to go. > > I think that's a good way to guarantee that you'll not finish in time > for 8.3. Heikki's idea is just at the handwaving stage at this point, > and I'm not even convinced that it will offer any win. (Pages in > cache will be picked up by a seqscan already.) The scenario that I'm worried about is that you have a table that's slightly larger than RAM. If you issue many seqscans on that table, one at a time, every seqscan will have to read the whole table from disk, even though say 90% of it is in cache when the scan starts. This can be alleviated by using a large enough sync_scan_offset, but a single setting like that is tricky to tune, especially if your workload is not completely constant. Tune it too low, and you don't get much benefit, tune it too high and your scans diverge and you lose all benefit. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-03-05 at 21:34 -0800, Sherry Moore wrote: > - Based on a lot of the benchmarks and workloads I traced, the > target buffer of read operations are typically accessed again > shortly after the read, while writes are usually not. Therefore, > the default operation mode is to bypass L2 for writes, but not > for reads. Hi Sherry, I'm trying to relate what you've said to how we should proceed from here. My understanding of what you've said is: - Tom's assessment that the observed performance quirk could be fixed in the OS kernel is correct and you have the numbers to prove it - currently Solaris only does NTA for 128K reads, which we don't currently do. If we were to request 16 blocks at time, we would get this benefit on Solaris, at least. The copyout_max_cached parameter can be patched, but isn't a normal system tunable. - other workloads you've traced *do* reuse the same buffer again very soon afterwards when reading sequentially (not writes). Reducing the working set size is an effective technique in improving performance if we don't have a kernel that does NTA or we don't read in big enough chunks (we need both to get NTA to kick in). and what you haven't said - all of this is orthogonal to the issue of buffer cache spoiling in PostgreSQL itself. That issue does still exist as a non-OS issue, but we've been discussing in detail the specific case of L2 cache effects with specific kernel calls. All of the test results have been stand-alone, so we've not done any measurements in that area. I say this because you make the point that reducing the working set size of write workloads has no effect on the L2 cache issue, but ISTM its still potentially a cache spoiling issue. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, 2007-03-06 at 18:47 +0000, Heikki Linnakangas wrote: > Tom Lane wrote: > > Jeff Davis <pgsql@j-davis.com> writes: > >> If I were to implement this idea, I think Heikki's bitmap of pages > >> already read is the way to go. > > > > I think that's a good way to guarantee that you'll not finish in time > > for 8.3. Heikki's idea is just at the handwaving stage at this point, > > and I'm not even convinced that it will offer any win. (Pages in > > cache will be picked up by a seqscan already.) > > The scenario that I'm worried about is that you have a table that's > slightly larger than RAM. If you issue many seqscans on that table, one > at a time, every seqscan will have to read the whole table from disk, > even though say 90% of it is in cache when the scan starts. > If you're issuing sequential scans one at a time, that 90% of the table that was cached is probably not cached any more, unless the scans are close together in time without overlapping (serial sequential scans). And the problem you describe is no worse than current behavior, where you have exactly the same problem. > This can be alleviated by using a large enough sync_scan_offset, but a > single setting like that is tricky to tune, especially if your workload > is not completely constant. Tune it too low, and you don't get much > benefit, tune it too high and your scans diverge and you lose all benefit. > I see why you don't want to manually tune this setting, however it's really not that tricky. You can be quite conservative and still use a good fraction of your physical memory. I will come up with some numbers and see how much we have to gain. Regards,Jeff Davis
On Mar 6, 2007, at 12:17 AM, Tom Lane wrote: > Jim Nasby <decibel@decibel.org> writes: >> An idea I've been thinking about would be to have the bgwriter or >> some other background process actually try and keep the free list >> populated, > > The bgwriter already tries to keep pages "just in front" of the clock > sweep pointer clean. True, but that still means that each backend has to run the clock- sweep. AFAICT that's something that backends will serialize on (due to BufFreelistLock), so it would be best to make StrategyGetBuffer as fast as possible. It certainly seems like grabbing a buffer off the free list is going to be a lot faster than running the clock sweep. That's why I think it'd be better to have the bgwriter run the clock sweep and put enough buffers on the free list to try and keep up with demand. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote: >> We also don't need an exact count, either. Perhaps there's some way >> we could keep a counter or something... > > Exact count of what? The pages already in cache? Yes. The idea being if you see there's 10k pages in cache, you can likely start 9k pages behind the current scan point and still pick everything up. But this is nowhere near as useful as the bitmap idea, so I'd only look at it if it's impossible to make the bitmaps work. And like others have said, that should wait until there's at least a first- generation patch that's going to make it into 8.3. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On Tue, 2007-03-06 at 17:43 -0700, Jim Nasby wrote: > On Mar 6, 2007, at 10:56 AM, Jeff Davis wrote: > >> We also don't need an exact count, either. Perhaps there's some way > >> we could keep a counter or something... > > > > Exact count of what? The pages already in cache? > > Yes. The idea being if you see there's 10k pages in cache, you can > likely start 9k pages behind the current scan point and still pick > everything up. > > But this is nowhere near as useful as the bitmap idea, so I'd only > look at it if it's impossible to make the bitmaps work. And like > others have said, that should wait until there's at least a first- > generation patch that's going to make it into 8.3. You still haven't told me how we take advantage of the OS buffer cache with the bitmap idea. What makes you think that my current implementation is "nowhere near as useful as the bitmap idea"? My current implementation is making use of OS buffers + shared memory; the bitmap idea can only make use of shared memory, and is likely throwing the OS buffers away completely. I also suspect that the bitmap idea relies too much on the idea that there's a contiguous cache trail in the shared buffers alone. Any devation from that -- which could be caused by PG's page replacement algorithm, especially in combination with a varied load pattern -- would negate any benefit from the bitmap idea. I feel much more confident that there will exist a trail of pages that are cached in *either* the PG shared buffers *or* the OS buffer cache. There may be holes/gaps in either one, but it's much more likely that they combine into a contiguous series of cached pages. Do you have an idea how I might test this claim? Regards,Jeff Davis
On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote: > Jeff Davis wrote: > > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: > >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: > >>> Another approach I proposed back in December is to not have a > >>> variable like that at all, but scan the buffer cache for pages > >>> belonging to the table you're scanning to initialize the scan. > >>> Scanning all the BufferDescs is a fairly CPU and lock heavy > >>> operation, but it might be ok given that we're talking about large > >>> I/O bound sequential scans. It would require no DBA tuning and > >>> would work more robustly in varying conditions. I'm not sure where > >>> you would continue after scanning the in-cache pages. At the > >>> highest in-cache block number, perhaps. > >> If there was some way to do that, it'd be what I'd vote for. > >> > > > > I still don't know how to make this take advantage of the OS buffer > > cache. > > Yep, I don't see any way to do that. I think we could live with that, > though. If we went with the sync_scan_offset approach, you'd have to > leave a lot of safety margin in that as well. > Right, there would certainly have to be a safety margin with sync_scan_offset. However, your plan only works when the shared buffers are dominated by this sequential scan. Let's say you have 40% of physical memory for shared buffers, and say that 50% are being used for hot pages in other parts of the database. That means you have access to only 20% of physical memory to optimize for this sequential scan, and 20% of the physical memory is basically unavailable (being used for other parts of the database). In my current implementation, you could set sync_scan_offset to 1.0 (meaning 1.0 x shared_buffers), giving you 40% of physical memory that would be used for starting this sequential scan. In this case, that should be a good margin of error, considering that as much as 80% of the physical memory might actually be in cache (OS or PG cache). This all needs to be backed up by testing, of course. I'm just extrapolating some numbers that look vaguely reasonable to me. Regards,Jeff Davis
<p><font size="2">Incidentally, we tried triggering NTA (L2 cache bypass) unconditionally and in various patterns and didnot see the substantial gain as with reducing the working set size.<br /><br /> My conclusion: Fixing the OS is not sufficientto alleviate the issue. We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data rates due to this effect.<br/><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br /> -----Original Message-----<br /> From: Sherry Moore [<a href="mailto:sherry.moore@sun.com">mailto:sherry.moore@sun.com</a>]<br /> Sent: Tuesday, March 06, 200710:05 PM Eastern Standard Time<br /> To: Simon Riggs<br /> Cc: Sherry Moore; Tom Lane; Luke Lonergan; Mark Kirkwood;Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady<br /> Subject: Re: [HACKERS] Bug: Buffer cache isnot scan resistant<br /><br /> Hi Simon,<br /><br /> > and what you haven't said<br /> ><br /> > - all of thisis orthogonal to the issue of buffer cache spoiling in<br /> > PostgreSQL itself. That issue does still exist as anon-OS issue, but<br /> > we've been discussing in detail the specific case of L2 cache effects<br /> > with specifickernel calls. All of the test results have been<br /> > stand-alone, so we've not done any measurements in thatarea. I say this<br /> > because you make the point that reducing the working set size of write<br /> > workloadshas no effect on the L2 cache issue, but ISTM its still<br /> > potentially a cache spoiling issue.<br /><br/> What I wanted to point out was that (reiterating to avoid requoting),<br /><br /> - My test was simply to demonstratethat the observed performance<br /> difference with VACUUM was caused by whether the size of the<br /> user buffer caused L2 thrashing.<br /><br /> - In general, application should reduce the size of the working set<br/> to reduce the penalty of TLB misses and cache misses.<br /><br /> - If the application access patternmeets the NTA trigger condition,<br /> the benefit of reducing the working set size will be much smaller.<br/><br /> Whatever I said is probably orthogonal to the buffer cache issue you<br /> guys have been discussing,but I haven't read all the email exchange<br /> on the subject.<br /><br /> Thanks,<br /> Sherry<br /> --<br />Sherry Moore, Solaris Kernel Development <a href="http://blogs.sun.com/sherrym">http://blogs.sun.com/sherrym</a><br/><br /></font>
Ühel kenal päeval, T, 2007-03-06 kell 18:28, kirjutas Jeff Davis: > On Tue, 2007-03-06 at 18:29 +0000, Heikki Linnakangas wrote: > > Jeff Davis wrote: > > > On Mon, 2007-03-05 at 21:02 -0700, Jim Nasby wrote: > > >> On Mar 5, 2007, at 2:03 PM, Heikki Linnakangas wrote: > > >>> Another approach I proposed back in December is to not have a > > >>> variable like that at all, but scan the buffer cache for pages > > >>> belonging to the table you're scanning to initialize the scan. > > >>> Scanning all the BufferDescs is a fairly CPU and lock heavy > > >>> operation, but it might be ok given that we're talking about large > > >>> I/O bound sequential scans. It would require no DBA tuning and > > >>> would work more robustly in varying conditions. I'm not sure where > > >>> you would continue after scanning the in-cache pages. At the > > >>> highest in-cache block number, perhaps. > > >> If there was some way to do that, it'd be what I'd vote for. > > >> > > > > > > I still don't know how to make this take advantage of the OS buffer > > > cache. Maybe it should not ? Mostly there can be use of OS cache only if it is much bigger than shared buffer cache. It may make sense to forget about OS cache and just tell those who can make use of sync scans to set most of memory aside for shared buffers. Then we can do better predictions/lookups of how much of a table is actually in memory. Dual caching is usually not very beneficial anyway, not to mention about difficulties in predicting any doual-caching effects. > > Yep, I don't see any way to do that. I think we could live with that, > > though. If we went with the sync_scan_offset approach, you'd have to > > leave a lot of safety margin in that as well. > > > > Right, there would certainly have to be a safety margin with > sync_scan_offset. However, your plan only works when the shared buffers > are dominated by this sequential scan. Let's say you have 40% of > physical memory for shared buffers, and say that 50% are being used for > hot pages in other parts of the database. That means you have access to > only 20% of physical memory to optimize for this sequential scan, and > 20% of the physical memory is basically unavailable (being used for > other parts of the database). The simplest thing in case table si much bigger than buffer cache usable for it is to start the second scan at the point the first scan is traversing *now*, and hope that the scans will stay together. Or start at some fixed lag, which makes the first scan to be always the one issuing reads and second just freerides on buffers already in cache. It may even be a good idea to throttle the second scan to stay N pages behind if the OS readahead gets confused when same file is read from multiple processes. If the table is smaller than the cache, then just scan it without syncing. Trying to read buffers in the same order starting from near the point where ppages are still in shared buffer cache seems good mostly for case where table is as big as or just a little larger than cache. > In my current implementation, you could set sync_scan_offset to 1.0 > (meaning 1.0 x shared_buffers), giving you 40% of physical memory that > would be used for starting this sequential scan. In this case, that > should be a good margin of error, considering that as much as 80% of the > physical memory might actually be in cache (OS or PG cache). > > This all needs to be backed up by testing, of course. I'm just > extrapolating some numbers that look vaguely reasonable to me. If there is an easy way to tell PG "give me this page only if it is in shared cache already", then a good approach might be to start 2nd scan at the point where 1st is now, and move in both directions simultabeously, like this: First scan is at page N. Second scan: M=N-1 WHILE NOT ALL PAGES ARE READ: IF PAGE N IS IN CACHE : -- FOLLOW FIRST READER READ PAGE N N++ ELSE IF M>=0AND PAGE M IS IN CACHE : -- READ OLDER CACHED PAGES READ PAGE M M-- ELSE IF FIRST READER STILL GOING: --NO OLDER PAGES, WAIT FOR 1st WAIT FOR PAGE N TO BECOME AVAILABLE READ PAGE N N++ ELSE: --BECOME 1st reader READ PAGE N N++ PROCESS PAGE -- IF N > PAGES_IF_TABLE: N=0 IF M < 0: M=PAGES_IF_TABLE This should work reasonably well for LRU caches and it may be made to work with clock sweep scheme if the sweep arranges pages to purge in file order. If we could make the IF PAGE x IS IN CACHE part also know about OS cache this could also make use of os cache. Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix systems ? -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
On 3/7/07, Hannu Krosing <hannu@skype.net> wrote: > Do any of you know about a way to READ PAGE ONLY IF IN CACHE in *nix > systems ? Supposedly you could mmap() a file and then do mincore() on the area to see which pages are cached. But you were talking about postgres cache before, there it should be easily implementable. -- marko
Hi Simon, > and what you haven't said > > - all of this is orthogonal to the issue of buffer cache spoiling in > PostgreSQL itself. That issue does still exist as a non-OS issue, but > we've been discussing in detail the specific case of L2 cache effects > with specific kernel calls. All of the test results have been > stand-alone, so we've not done any measurements in that area. I say this > because you make the point that reducing the working set size of write > workloads has no effect on the L2 cache issue, but ISTM its still > potentially a cache spoiling issue. What I wanted to point out was that (reiterating to avoid requoting), - My test was simply to demonstrate that the observed performance difference with VACUUM was caused by whether thesize of the user buffer caused L2 thrashing. - In general, application should reduce the size of the working set to reduce the penalty of TLB misses and cachemisses. - If the application access pattern meets the NTA trigger condition, the benefit of reducing the working set sizewill be much smaller. Whatever I said is probably orthogonal to the buffer cache issue you guys have been discussing, but I haven't read all the email exchange on the subject. Thanks, Sherry -- Sherry Moore, Solaris Kernel Development http://blogs.sun.com/sherrym
On Tue, 2007-03-06 at 22:32 -0500, Luke Lonergan wrote: > Incidentally, we tried triggering NTA (L2 cache bypass) > unconditionally and in various patterns and did not see the > substantial gain as with reducing the working set size. > > My conclusion: Fixing the OS is not sufficient to alleviate the issue. > We see a 2x penalty (1700MB/s versus 3500MB/s) at the higher data > rates due to this effect. > I've implemented buffer recycling, as previously described, patch being posted now to -patches as "scan_recycle_buffers". This version includes buffer recycling - for SeqScans larger than shared buffers, with the objective of improving L2 cache efficiency *and* reducing the effects of shared buffer cache spoiling (both as previously discussed on this thread) - for VACUUMs of any size, with the objective of reducing WAL thrashing whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as originally suggested by Itagaki-san, just with a different implementation). Behaviour is not activated by default in this patch. To request buffer recycling, set the USERSET GUCSET scan_recycle_buffers = N tested with 1,4,8,16, but only > 8 seems sensible, IMHO. Patch effects StrategyGetBuffer, so only effects the disk->cache path. The idea is that if its already in shared buffer cache then we get substantial benefit already and nothing else is needed. So for the general case, the patch adds a single if test into the I/O path. The parameter is picked up at the start of SeqScan and VACUUM (currently). Any change mid-scan will be ignored. IMHO its possible to do this and to allow Synch Scans at the same time, with some thought. There is no need for us to rely on cache spoiling behaviour of scans to implement that feature as well. Independent performance tests requested, so that we can discuss this objectively. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
<p><font size="2">Cool!<br /><br /> - Luke<br /><br /> Msg is shrt cuz m on ma treo<br /><br /> -----Original Message-----<br/> From: Simon Riggs [<a href="mailto:simon@2ndquadrant.com">mailto:simon@2ndquadrant.com</a>]<br /> Sent: Friday, March 09, 2007 02:32 PM Eastern Standard Time<br /> To: Luke Lonergan; ITAGAKI Takahiro<br /> Cc: Sherry Moore; Tom Lane; Mark Kirkwood; Pavan Deolasee; Gavin Sherry; PGSQL Hackers; Doug Rady<br /> Subject: Re: [HACKERS] Bug: Buffer cache is not scan resistant<br /><br /> On Tue, 2007-03-06 at 22:32 -0500, LukeLonergan wrote:<br /> > Incidentally, we tried triggering NTA (L2 cache bypass)<br /> > unconditionally and invarious patterns and did not see the<br /> > substantial gain as with reducing the working set size.<br /> ><br />> My conclusion: Fixing the OS is not sufficient to alleviate the issue.<br /> > We see a 2x penalty (1700MB/s versus3500MB/s) at the higher data<br /> > rates due to this effect.<br /> ><br /> I've implemented buffer recycling,as previously described, patch being<br /> posted now to -patches as "scan_recycle_buffers".<br /><br /> This versionincludes buffer recycling<br /><br /> - for SeqScans larger than shared buffers, with the objective of<br /> improvingL2 cache efficiency *and* reducing the effects of shared<br /> buffer cache spoiling (both as previously discussedon this thread)<br /><br /> - for VACUUMs of any size, with the objective of reducing WAL thrashing<br /> whilstkeeping VACUUM's behaviour of not spoiling the buffer cache (as<br /> originally suggested by Itagaki-san, just witha different<br /> implementation).<br /><br /> Behaviour is not activated by default in this patch. To request buffer<br/> recycling, set the USERSET GUC<br /> SET scan_recycle_buffers = N<br /> tested with 1,4,8,16, but only> 8 seems sensible, IMHO.<br /><br /> Patch effects StrategyGetBuffer, so only effects the disk->cache path.<br/> The idea is that if its already in shared buffer cache then we get<br /> substantial benefit already and nothingelse is needed. So for the<br /> general case, the patch adds a single if test into the I/O path.<br /><br /> Theparameter is picked up at the start of SeqScan and VACUUM<br /> (currently). Any change mid-scan will be ignored.<br /><br/> IMHO its possible to do this and to allow Synch Scans at the same time,<br /> with some thought. There is no needfor us to rely on cache spoiling<br /> behaviour of scans to implement that feature as well.<br /><br /> Independentperformance tests requested, so that we can discuss this<br /> objectively.<br /><br /> --<br /> Simon Riggs <br/> EnterpriseDB <a href="http://www.enterprisedb.com">http://www.enterprisedb.com</a><br /><br /><br/><br /></font>
"Simon Riggs" <simon@2ndquadrant.com> wrote: > I've implemented buffer recycling, as previously described, patch being > posted now to -patches as "scan_recycle_buffers". > > - for VACUUMs of any size, with the objective of reducing WAL thrashing > whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as > originally suggested by Itagaki-san, just with a different > implementation). I tested your patch with VACUUM FREEZE. The performance was improved when I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic, but this patch should be useful for normal VACUUMs with backgrond jobs! N | time | WAL flush(*) -----+-------+----------- 0 | 58.7s | 0.01% 1 | 80.3s | 81.76% 8 | 73.4s | 16.73% 16 | 64.2s | 9.24% 32 | 59.0s | 4.88%64 | 56.7s | 2.63%128 | 55.1s | 1.41% (*) WAL flush is the ratio of the need of fsync to buffer recycle. # SET scan_recycle_buffers = 0; # UPDATE accounts SET aid=aid WHERE random() < 0.005; # CHECKPOINT; # SET scan_recycle_buffers = <N>; # VACUUM FREEZE accounts; BTW, does the patch change the default usage of buffer in vacuum? From what I've seen, scan_recycle_buffers = 1 is the same as before. With the default value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, just like existing sequential scans. Is this intended? Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: > "Simon Riggs" <simon@2ndquadrant.com> wrote: > > > I've implemented buffer recycling, as previously described, patch being > > posted now to -patches as "scan_recycle_buffers". > > > > - for VACUUMs of any size, with the objective of reducing WAL thrashing > > whilst keeping VACUUM's behaviour of not spoiling the buffer cache (as > > originally suggested by Itagaki-san, just with a different > > implementation). > > I tested your patch with VACUUM FREEZE. The performance was improved when > I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic, > but this patch should be useful for normal VACUUMs with backgrond jobs! Thanks. > N | time | WAL flush(*) > -----+-------+----------- > 0 | 58.7s | 0.01% > 1 | 80.3s | 81.76% > 8 | 73.4s | 16.73% > 16 | 64.2s | 9.24% > 32 | 59.0s | 4.88% > 64 | 56.7s | 2.63% > 128 | 55.1s | 1.41% > > (*) WAL flush is the ratio of the need of fsync to buffer recycle. Do you have the same measurement without patch applied? I'd be interested in the current state also (the N=0 path is modified as well for VACUUM, in this patch). > # SET scan_recycle_buffers = 0; > # UPDATE accounts SET aid=aid WHERE random() < 0.005; > # CHECKPOINT; > # SET scan_recycle_buffers = <N>; > # VACUUM FREEZE accounts; Very good results, thanks. I'd be interested in the same thing for just VACUUM and for varying ratios of dirty/clean blocks during vacuum. > BTW, does the patch change the default usage of buffer in vacuum? From what > I've seen, scan_recycle_buffers = 1 is the same as before. That was the intention. > With the default > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, > just like existing sequential scans. Is this intended? Yes, but its not very useful for testing to have done that. I'll do another version within the hour that sets N=0 (only) back to current behaviour for VACUUM. One of the objectives of the patch was to prevent VACUUM from tripping up other backends. I'm confident that it will improve that situation for OLTP workloads running regular concurrent VACUUMs, but we will need to wait a couple of days to get those results in also. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote: > On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: > > With the default > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, > > just like existing sequential scans. Is this intended? > > Yes, but its not very useful for testing to have done that. I'll do > another version within the hour that sets N=0 (only) back to current > behaviour for VACUUM. New test version enclosed, where scan_recycle_buffers = 0 doesn't change existing VACUUM behaviour. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Attachment
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > I tested your patch with VACUUM FREEZE. The performance was improved when > I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic, > but this patch should be useful for normal VACUUMs with backgrond jobs! Proving that you can see a different in a worst-case scenario is not the same as proving that the patch is useful in normal cases. regards, tom lane
On Mon, 2007-03-12 at 10:30 -0400, Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > > I tested your patch with VACUUM FREEZE. The performance was improved when > > I set scan_recycle_buffers > 32. I used VACUUM FREEZE to increase WAL traffic, > > but this patch should be useful for normal VACUUMs with backgrond jobs! > > Proving that you can see a different in a worst-case scenario is not the > same as proving that the patch is useful in normal cases. I agree, but I think that this VACUUM FREEZE test does actually represent the normal case, here's why: The poor buffer manager behaviour occurs if the block is dirty as a result of WAL-logged changes. It only takes the removal of a single row for us to have WAL logged this and dirtied the block. If we invoke VACUUM from autovacuum, we do this by default when 20% of the rows have been updated, which means with many distributions of UPDATEs we'll have touched a very large proportion of blocks before we VACUUM. That isn't true for *all* distributions of UPDATEs, but it will soon be. Dead Space Map will deliver only dirty blocks for us. So running a VACUUM FREEZE is a reasonable simulation of running a large VACUUM on a real production system with default autovacuum enabled, as will be the case for 8.3. It is possible that we run VACUUM when fewer dirty blocks are generated, but this won't be the common situation and not something we should optimise for. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> wrote: > > > With the default > > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, > > > just like existing sequential scans. Is this intended? > > > New test version enclosed, where scan_recycle_buffers = 0 doesn't change > existing VACUUM behaviour. This is a result with scan_recycle_buffers.v3.patch. I used normal VACUUM with background load using slowdown-ed pgbench in this instance. I believe the patch is useful in normal cases, not only for VACUUM FREEZE. N | time | WAL flush(*) -----+--------+----------- 0 | 112.8s | 44.3% 1 | 148.9s | 52.1% 8 | 105.1s | 17.6% 16 | 96.9s | 8.7% 32 | 103.9s | 6.3% 64 | 89.4s | 6.6%128 | 80.0s | 3.8% Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
Simon, You may know we've built something similar and have seen similar gains. We're planning a modification that I think you should consider: when there is a sequential scan of a table larger than the size of shared_buffers, we are allowing the scan to write through the shared_buffers cache. The hypothesis is that if a relation is of a size equal to or less than the size of shared_buffers, it is "cacheable" and should use the standard LRU approach to provide for reuse. - Luke On 3/12/07 3:08 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote: > On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote: >> On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: > >>> With the default >>> value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in >>> pool, >>> just like existing sequential scans. Is this intended? >> >> Yes, but its not very useful for testing to have done that. I'll do >> another version within the hour that sets N=0 (only) back to current >> behaviour for VACUUM. > > New test version enclosed, where scan_recycle_buffers = 0 doesn't change > existing VACUUM behaviour.
On Mon, 2007-03-12 at 22:16 -0700, Luke Lonergan wrote: > You may know we've built something similar and have seen similar gains. Cool > We're planning a modification that I think you should consider: when there > is a sequential scan of a table larger than the size of shared_buffers, we > are allowing the scan to write through the shared_buffers cache. Write? For which operations? I was thinking to do this for bulk writes also, but it would require changes to bgwriter's cleaning sequence. Are you saying to write say ~32 buffers then fsync them, rather than letting bgwriter do that? Then allow those buffers to be reused? > The hypothesis is that if a relation is of a size equal to or less than the > size of shared_buffers, it is "cacheable" and should use the standard LRU > approach to provide for reuse. Sounds reasonable. Please say more. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
On Tue, 2007-03-13 at 13:40 +0900, ITAGAKI Takahiro wrote: > "Simon Riggs" <simon@2ndquadrant.com> wrote: > > > > > With the default > > > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, > > > > just like existing sequential scans. Is this intended? > > > > > New test version enclosed, where scan_recycle_buffers = 0 doesn't change > > existing VACUUM behaviour. > > This is a result with scan_recycle_buffers.v3.patch. I used normal VACUUM > with background load using slowdown-ed pgbench in this instance. I believe > the patch is useful in normal cases, not only for VACUUM FREEZE. > > N | time | WAL flush(*) > -----+--------+----------- > 0 | 112.8s | 44.3% > 1 | 148.9s | 52.1% > 8 | 105.1s | 17.6% > 16 | 96.9s | 8.7% > 32 | 103.9s | 6.3% > 64 | 89.4s | 6.6% > 128 | 80.0s | 3.8% Looks good. Not sure what value of N to pick for normal use. The objectives are i) don't let VACUUMs spoil the cache ii) speed up standalone VACUUMs iii) don't let VACUUM cause others to repeatedly WAL flush I'm thinking N=16 meets all 3 objectives. We could make VACUUM go faster still, but by knocking more blocks out of cache that someone doing real work might need. That will slow them down almost as much as forcing them to flush WAL, so I'd want to be conservative with VACUUM. Does anybody think we need a new parameter for this, or are we happy at 16 buffers in the recycle loop for VACUUM? At this point I should note something I haven't mentioned before. VACUUMs force other backends to flush out WAL only when we have an I/O bound workload. If the database already fits in memory then BufferAlloc never needs to run and therefore we don't need to flush WAL. So I can understand that the effect of WAL flushing may not have been noticed by many testers. -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
Simon, On 3/13/07 2:37 AM, "Simon Riggs" <simon@2ndquadrant.com> wrote: >> We're planning a modification that I think you should consider: when there >> is a sequential scan of a table larger than the size of shared_buffers, we >> are allowing the scan to write through the shared_buffers cache. > > Write? For which operations? I'm actually just referring to the sequential scan "writing into" the shared buffers cache, sorry for the "write through" :-) > I was thinking to do this for bulk writes also, but it would require > changes to bgwriter's cleaning sequence. Are you saying to write say ~32 > buffers then fsync them, rather than letting bgwriter do that? Then > allow those buffers to be reused? Off topic, but we think we just found the reason(s) for the abysmal heap insert performance of pgsql and are working on a fix to that as well. It involves two main things: the ping-ponging seeks used to extend a relfile and the bgwriter not flushing aggressively enough. We're hoping to move the net heap insert rate from 12MB/s for a single stream to something more like 100 MB/s per stream, but it may take a week to get some early results and find out if we're on the right track. We've been wrong on this before ;-) - Luke
Simon, is this patch ready to be added to the patch queue? I assume not. --------------------------------------------------------------------------- Simon Riggs wrote: > On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote: > > On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: > > > > With the default > > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, > > > just like existing sequential scans. Is this intended? > > > > Yes, but its not very useful for testing to have done that. I'll do > > another version within the hour that sets N=0 (only) back to current > > behaviour for VACUUM. > > New test version enclosed, where scan_recycle_buffers = 0 doesn't change > existing VACUUM behaviour. > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
"test" version, but I am putting in the queue so we can track it there. Your patch has been added to the PostgreSQL unapplied patches list at: http://momjian.postgresql.org/cgi-bin/pgpatches It will be applied as soon as one of the PostgreSQL committers reviews and approves it. --------------------------------------------------------------------------- Simon Riggs wrote: > On Mon, 2007-03-12 at 09:14 +0000, Simon Riggs wrote: > > On Mon, 2007-03-12 at 16:21 +0900, ITAGAKI Takahiro wrote: > > > > With the default > > > value of scan_recycle_buffers(=0), VACUUM seems to use all of buffers in pool, > > > just like existing sequential scans. Is this intended? > > > > Yes, but its not very useful for testing to have done that. I'll do > > another version within the hour that sets N=0 (only) back to current > > behaviour for VACUUM. > > New test version enclosed, where scan_recycle_buffers = 0 doesn't change > existing VACUUM behaviour. > > -- > Simon Riggs > EnterpriseDB http://www.enterprisedb.com > [ Attachment, skipping... ] > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +