Thread: Limit of bgwriter_lru_maxpages of max. 1000?
Hello, I think the limit of bgwriter_lru_maxpages of max. 1000 doesn't make any sense because in fact it limits performance of the database (version 8.3). On heavy write operations buffer cached must be freed. With the default config this is practically limited to: bgwriter_delay=200ms bgwriter_lru_maxpages=100 8k*bgwriter_lru_maxpages*1000/bgwriter_delay= =8k*100*1000/200=4000k=4MB/s Isn't that a major performancd bottleneck in default config? bgwriter_delay=200ms bgwriter_lru_maxpages=1000 8k*bgwriter_lru_maxpages*1000/bgwriter_delay= =8k*1000*1000/200=40000k=40MB/s Still not a very high number for current I/O loads. Lowering bgwriter_delay is possible, but I think overhead is too much and still there is a limit of 800MB/s involved: bgwriter_delay=10ms bgwriter_lru_maxpages=1000 8k*bgwriter_lru_maxpages*1000/bgwriter_delay= =8k*1000*1000/10=800000k=800MB/s So I think it would be better to have such a configuration: bgwriter_delay=50ms bgwriter_lru_maxpages=100000 8k*bgwriter_lru_maxpages*1000/bgwriter_delay= =8k*100000*1000/50=16000000k=16000MB/s So in fact I think bgwriter_lru_maxpages should be limited to 100000 if limited at all. Are my argumentations correct? Any comments? Thnx. Ciao, Gerhard -- http://www.wiesinger.com/
On Sun, 27 Sep 2009, Gerhard Wiesinger wrote: > Lowering bgwriter_delay is possible, but I think overhead is too much and > still there is a limit of 800MB/s involved: Stuff written by the background writer turns into largely random I/O. 800MB/s of random writes is so large of a number it's only recently become remotely possible; a RAID0 of SSD devices might manage it. No hardware available until very recently had any hope of getting that sort of performance. In any case, I would wager you'll run into one of many other bottlenecks in PostgreSQL and/or currently available system/disk hardware long before the background writer limit gets important. > So in fact I think bgwriter_lru_maxpages should be limited to 100000 if > limited at all. The current limit is based on the assumption that people will set it to values way too high if allowed, to the point where it's counterproductive. That's exactly what people used to do with early background writer designs. I think you're wandering down the same road, where what it actually does and what you think it does are not the same thing at all. Much of the important disk I/O coming out of the database should be related to checkpoints, not the background writer, and there is no limit on that I/O. If you think you've got a situation where the current limits are not sufficient, the path to argue that would start with showing what you're seeing in pg_stat_bgwriter. I can imagine some results from there on a system with a very high rate of I/O available that would suggest the current limits are too small. I've never come close to actually seeing such results in the real world though, and if you're not already monitoring those numbers on a real system I'd suggest you start there rather than presuming there's a design limitation here. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 2 Oct 2009, Greg Smith wrote: > On Sun, 27 Sep 2009, Gerhard Wiesinger wrote: > >> Lowering bgwriter_delay is possible, but I think overhead is too much and >> still there is a limit of 800MB/s involved: > > Stuff written by the background writer turns into largely random I/O. 800MB/s > of random writes is so large of a number it's only recently become remotely > possible; a RAID0 of SSD devices might manage it. No hardware available > until very recently had any hope of getting that sort of performance. > > In any case, I would wager you'll run into one of many other bottlenecks in > PostgreSQL and/or currently available system/disk hardware long before the > background writer limit gets important. > Of course, 800MB/s are a theoretical max. limit I could thought of. But with SSDs this might be possible. >> So in fact I think bgwriter_lru_maxpages should be limited to 100000 if >> limited at all. > > The current limit is based on the assumption that people will set it to > values way too high if allowed, to the point where it's counterproductive. > That's exactly what people used to do with early background writer designs. > I think you're wandering down the same road, where what it actually does and > what you think it does are not the same thing at all. Much of the important > disk I/O coming out of the database should be related to checkpoints, not the > background writer, and there is no limit on that I/O. > In my experience flushing I/O as soon as possible is the best solution. Think of the following scenario: You currently limit bgwriter at 4MB/s but you would have about 10MB/s random I/O capacity (a normal low cost system). So utilitzzation would be only 40% and you could write even more. At checkpoint time you would get a spike which the I/O system couldn't handle at all and performance goes down to nearly zero because of the I/O spike (e.g. 500% of available I/O needed). IHMO such scenarios should be avoided. > If you think you've got a situation where the current limits are not > sufficient, the path to argue that would start with showing what you're > seeing in pg_stat_bgwriter. I can imagine some results from there on a > system with a very high rate of I/O available that would suggest the current > limits are too small. I've never come close to actually seeing such results > in the real world though, and if you're not already monitoring those numbers > on a real system I'd suggest you start there rather than presuming there's a > design limitation here. > On an nearly idle database with sometimes some performance tests: SELECT buffers_checkpoint/buffers_clean AS checkpoint_spike, ROUND(100.0*buffers_checkpoint/(buffers_checkpoint + buffers_clean + buffers_backend),2) AS checkpoint_percentage, ROUND(100.0*buffers_clean/(buffers_checkpoint + buffers_clean + buffers_backend),2) AS pg_writer_percentage, ROUND(100.0*buffers_backend/(buffers_checkpoint + buffers_clean + buffers_backend),2) AS backend_percentage FROM pg_stat_bgwriter ; checkpoint_spike | checkpoint_percentage | pg_writer_percentage | backend_percentage ------------------+-----------------------+----------------------+-------------------- 31 | 90.58 | 2.92 | 6.50 So flushing happens typically at checkpoint time. In 6.5%of all blocks were put by the backend on disk which says IHMO: pgwriter is to slow, backend has to do the work now. So I'd like to do some tests with new statistics. Any fast way to reset statistics for all databases for pg_stat_pgwriter? Thnx. Ciao, Gerhard
On Fri, 2 Oct 2009, Gerhard Wiesinger wrote: > In my experience flushing I/O as soon as possible is the best solution. That what everyone assumes, but detailed benchmarks of PostgreSQL don't actually support that view given how the database operates. We went through a lot of work in 8.3 related to how to optimize the database as a system that disproved some of the theories about what would work well here. What happens if you're really aggressive about writing blocks out as soon as they're dirty is that you waste a lot of I/O on things that just get dirty again later. Since checkpoint time is the only period where blocks *must* get written, the approach that worked the best for reducing checkpoint spikes was to spread the checkpoint writes out over a very wide period. The only remaining work that made sense for the background writer was to tightly focus the background writer its I/O on blocks that are about to be evicted due to low usage no matter what. In most cases where people think they need more I/O from the background writer, what you actually want is to increase checkpoint_segments, checkpoint_completion_target, and checkpoint_timeout in order to spread the checkpoint I/O out over a longer period. The stats you provided suggest this is working exactly as intended. As far as work to improve the status quo, IMHO the next thing to improve is getting the fsync calls made at checkpoint time more intelligently spread over the whole period. That's got a better payback than trying to make the background writer more aggressive, which is basically a doomed cause. > So I'd like to do some tests with new statistics. Any fast way to reset > statistics for all databases for pg_stat_pgwriter? No, that's an open TODO item I keep meaning to fix; we lost that capability at one point. What I do is create a table that looks just like it, but with a time stamp, and save snapshots to that table. Then a view on top can generate just the deltas between two samples to show activity during that time. It's handy to have such a history anyway. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, Oct 2, 2009 at 2:19 PM, Greg Smith <gsmith@gregsmith.com> wrote: > On Fri, 2 Oct 2009, Gerhard Wiesinger wrote: > >> In my experience flushing I/O as soon as possible is the best solution. > > That what everyone assumes, but detailed benchmarks of PostgreSQL don't > actually support that view given how the database operates. We went through > a lot of work in 8.3 related to how to optimize the database as a system > that disproved some of the theories about what would work well here. > > What happens if you're really aggressive about writing blocks out as soon as > they're dirty is that you waste a lot of I/O on things that just get dirty > again later. Since checkpoint time is the only period where blocks *must* > get written, the approach that worked the best for reducing checkpoint > spikes was to spread the checkpoint writes out over a very wide period. The session servers we have at work are a perfect match for this. By increasing checkpoint segments to 100 (or more), timeout to 60 minutes, and setting completion target lower (currently 0.25) we have reduced our IO wait from 10 to 15% to nearly nothing. These are databases that update the same rows over and over with session data as the user navigates the system, so writing things out as early as possible is a REAL bad idea. > In most cases where people think they need more I/O from the background > writer, what you actually want is to increase checkpoint_segments, > checkpoint_completion_target, and checkpoint_timeout in order to spread the > checkpoint I/O out over a longer period. The stats you provided suggest > this is working exactly as intended. I found that lowering checkpoint completion target was what helped. Does that seem counter-intuitive to you?
Scott Marlowe <scott.marlowe@gmail.com> writes: > The session servers we have at work are a perfect match for this. By > increasing checkpoint segments to 100 (or more), timeout to 60 > minutes, and setting completion target lower (currently 0.25) we have > reduced our IO wait from 10 to 15% to nearly nothing. These are > databases that update the same rows over and over with session data as > the user navigates the system, so writing things out as early as > possible is a REAL bad idea. > I found that lowering checkpoint completion target was what helped. > Does that seem counter-intuitive to you? Once the checkpoint completion target time is high enough that the checkpoint-induced I/O is just background noise for you, increasing the target further won't make for any noticeable further improvement. I'm not sure I see how it would make things *worse* though. Maybe, even though the I/O wait is "nearly nothing", the I/O is still forcing enough extra seeks to slow normal disk operations? If so, getting the checkpoint out of the way sooner so that you can get back to full speed operation sooner might be better than reducing the rate of checkpoint I/Os below the nearly-noise level. I'm just guessing about that though. What were you measuring --- the performance within checkpoint, the performance outside it, or the whole-cycle average? regards, tom lane
Tom Lane wrote: > Scott Marlowe <scott.marlowe@gmail.com> writes: > > The session servers we have at work are a perfect match for this. By > > increasing checkpoint segments to 100 (or more), timeout to 60 > > minutes, and setting completion target lower (currently 0.25) we have > > reduced our IO wait from 10 to 15% to nearly nothing. These are > > databases that update the same rows over and over with session data as > > the user navigates the system, so writing things out as early as > > possible is a REAL bad idea. > > > I found that lowering checkpoint completion target was what helped. > > Does that seem counter-intuitive to you? > > Once the checkpoint completion target time is high enough that the > checkpoint-induced I/O is just background noise for you, increasing the > target further won't make for any noticeable further improvement. I'm > not sure I see how it would make things *worse* though. Maybe, even > though the I/O wait is "nearly nothing", the I/O is still forcing enough > extra seeks to slow normal disk operations? If so, getting the > checkpoint out of the way sooner so that you can get back to full speed > operation sooner might be better than reducing the rate of checkpoint > I/Os below the nearly-noise level. I'm just guessing about that though. > What were you measuring --- the performance within checkpoint, the > performance outside it, or the whole-cycle average? My guess is that having a very long fuzzy checkpoint time means that when you fsync you are fsync'ing lots of data, both your checkpoint data and other writes performed by backends. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > My guess is that having a very long fuzzy checkpoint time means that > when you fsync you are fsync'ing lots of data, both your checkpoint data > and other writes performed by backends. Hmm, could be ... although that would imply that shared_buffers should be kicked up some more, so the backends aren't doing so many writes for themselves. regards, tom lane
On Fri, Oct 2, 2009 at 3:55 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Scott Marlowe <scott.marlowe@gmail.com> writes: >> The session servers we have at work are a perfect match for this. By >> increasing checkpoint segments to 100 (or more), timeout to 60 >> minutes, and setting completion target lower (currently 0.25) we have >> reduced our IO wait from 10 to 15% to nearly nothing. These are >> databases that update the same rows over and over with session data as >> the user navigates the system, so writing things out as early as >> possible is a REAL bad idea. > >> I found that lowering checkpoint completion target was what helped. >> Does that seem counter-intuitive to you? > > Once the checkpoint completion target time is high enough that the > checkpoint-induced I/O is just background noise for you, increasing the > target further won't make for any noticeable further improvement. I'm > not sure I see how it would make things *worse* though. Maybe, even > though the I/O wait is "nearly nothing", the I/O is still forcing enough > extra seeks to slow normal disk operations? If so, getting the > checkpoint out of the way sooner so that you can get back to full speed > operation sooner might be better than reducing the rate of checkpoint > I/Os below the nearly-noise level. I'm just guessing about that though. > What were you measuring --- the performance within checkpoint, the > performance outside it, or the whole-cycle average? I was measuring it over an extended period, say a few hours. This db is small enough to fit in memory easily (1.5Gig on a machine with 6Gig ram doing nothing else) so all the io is basically blocks out, with none in. Since the checkpoints should have a LOT of the same records updated over and over, I'm guessing that a very low completion target lets it collect a lot of those together and just write out the last one. The nice thing is I can benchmark one of these machines against the other, since they're basically identical twins doing the same job, and see how changes like this affect them. At first I was at something like 0.5 completion target, and increasing the checkpoint segments did make an effect, but lowering the completion target was as much of a gain as the increased checkpoint segments. So, I think the tuning of a small db like this that can fit in memory is a whole different ball game than one that's several times larger than memory.
On Fri, 2 Oct 2009, Scott Marlowe wrote: > I found that lowering checkpoint completion target was what helped. > Does that seem counter-intuitive to you? Generally, but there are plenty of ways you can get into a state where a short but not immediate checkpoint is better. For example, consider a case where your buffer cache is filled with really random stuff. There's a sorting horizon in effect, where your OS and/or controller makes decisions about what order to write things based on the data it already has around, not really knowing what's coming in the near future. Let's say you've got 256MB of cache in the disk controller, you have 1GB of buffer cache to write out, and there's 8GB of RAM in the server so it can cache the whole write. If you wrote it out in a big burst, the OS would elevator sort things and feed them to the controller in disk order. Very efficient, one pass over the disk to write everything out. But if you broke that up into 256MB write pieces instead on the database side, pausing after each chunk was written, the OS would only be sorting across 256MB at a time, and would basically fill the controller cache up with that before it saw the larger picture. The disk controller can end up making seek decisions with that small of a planning window now that are not really optimal, making more passes over the disk to write the same data out. If the timing between the DB write cache and the OS is pathologically out of sync here, the result can end up being slower than had you just written out in bigger chunks instead. This is one reason I'd like to see fsync calls happen earlier and more evenly than they do now, to reduce these edge cases. The usual approach I take in this situation is to reduce the amount of write caching the OS does, so at least things get more predictable. A giant write cache always gives the best average performance, but the worst-case behavior increases at the same time. There was a patch floating around at one point that sorted all the checkpoint writes by block order, which would reduce how likely it is you'll end up in one of these odd cases. That turned out to be hard to nail down the benefit of though, because in a typical case the OS caching here trumps any I/O scheduling you try to do in user land, and it's hard to repeatibly generate scattered data in a benchmark situation. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
VTD-XML 2.7 is released and can be downloaded at http://sourceforge.net/projects/vtd-xml/files/ Below is a summary of what are the new features and enhancements. Expanded VTD-XML's Core API * VTDNav: toStringUpperCase, toStringLowerCase, contains(), endsWith(), startsWith() * Extended VTD added in-memory buffer support Improved Xpath * added the following XPath 2.0 functions: abs(), ends-with(), upper-case(), lower-case() * added support for variable reference * significantly enhanced XPath syntax, checking error reporting (Special thanks to Mark Swanson) * Internal performance tuning Bug fixes and Code Enhancement * C version significantly removed warning message, fix memory leak during Xpath expression parsing, * Various bug fies (Special thanks to Jon Roberts, John Zhu, Matej Spiller, Steve Polson, and Romain La Tellier)
On Fri, 2 Oct 2009, Greg Smith wrote: > On Fri, 2 Oct 2009, Scott Marlowe wrote: > >> I found that lowering checkpoint completion target was what helped. >> Does that seem counter-intuitive to you? > I set it to 0.0 now. > Generally, but there are plenty of ways you can get into a state where a > short but not immediate checkpoint is better. For example, consider a case > where your buffer cache is filled with really random stuff. There's a > sorting horizon in effect, where your OS and/or controller makes decisions > about what order to write things based on the data it already has around, not > really knowing what's coming in the near future. > Ok, if checkpoint doesn't block anything on normal operation time doesn't really matter. > Let's say you've got 256MB of cache in the disk controller, you have 1GB of > buffer cache to write out, and there's 8GB of RAM in the server so it can > cache the whole write. If you wrote it out in a big burst, the OS would > elevator sort things and feed them to the controller in disk order. Very > efficient, one pass over the disk to write everything out. > > But if you broke that up into 256MB write pieces instead on the database > side, pausing after each chunk was written, the OS would only be sorting > across 256MB at a time, and would basically fill the controller cache up with > that before it saw the larger picture. The disk controller can end up making > seek decisions with that small of a planning window now that are not really > optimal, making more passes over the disk to write the same data out. If the > timing between the DB write cache and the OS is pathologically out of sync > here, the result can end up being slower than had you just written out in > bigger chunks instead. This is one reason I'd like to see fsync calls happen > earlier and more evenly than they do now, to reduce these edge cases. > > The usual approach I take in this situation is to reduce the amount of write > caching the OS does, so at least things get more predictable. A giant write > cache always gives the best average performance, but the worst-case behavior > increases at the same time. > > There was a patch floating around at one point that sorted all the checkpoint > writes by block order, which would reduce how likely it is you'll end up in > one of these odd cases. That turned out to be hard to nail down the benefit > of though, because in a typical case the OS caching here trumps any I/O > scheduling you try to do in user land, and it's hard to repeatibly generate > scattered data in a benchmark situation. > Ok, on a basic insert test and a systemtap script (http://www.wiesinger.com/opensource/systemtap/postgresql-checkpoint.stp) checkpoint is still a major I/O spike. ################################################################################ Buffers between : Sun Oct 4 18:29:50 2009, synced 55855 buffer(s), flushed 744 buffer(s) between checkpoint Checkpoint start: Sun Oct 4 18:29:50 2009 Checkpoint end : Sun Oct 4 18:29:56 2009, synced 12031 buffer(s), flushed 12031 buffer(s) ################################################################################ Buffers between : Sun Oct 4 18:30:20 2009, synced 79000 buffer(s), flushed 0 buffer(s) between checkpoint Checkpoint start: Sun Oct 4 18:30:20 2009 Checkpoint end : Sun Oct 4 18:30:26 2009, synced 10753 buffer(s), flushed 10753 buffer(s) ################################################################################ Buffers between : Sun Oct 4 18:30:50 2009, synced 51120 buffer(s), flushed 1007 buffer(s) between checkpoint Checkpoint start: Sun Oct 4 18:30:50 2009 Checkpoint end : Sun Oct 4 18:30:56 2009, synced 11899 buffer(s), flushed 11912 buffer(s) ################################################################################ Ok, I further had a look at the code to understand the behavior of the buffercache and the background writer since that wasn't logically. So as far as I saw the basic algorithm is: 1.) Normally (non checkpoints) only dirty and non recently used pages (usage_count == 0) are flushed to disk. I think that's basically fine as a strategy as indexes might update blocks more than once. It's also ok that blocks are written and not flushed (well be done on checkpoint time). 2.) At checkpoints write out all dirty buffers and flush all previously written and newly written. Also spreading I/O seems also ok to me now. BUT: I think I've found 2 major bugs in the implementation (or I didn't understand something correctly). Codebase analyzed is 8.3.8 since I currently use it. ############################################## Bug1: usage_count is IHMO not consistent ############################################## I think this has been introduced with: http://git.postgresql.org/gitweb?p=postgresql.git;a=blobdiff;f=src/backend/storage/buffer/bufmgr.c;h=6e6b862273afea40241e410e18fd5d740c2b1643;hp=97f7822077de683989a064cdc624a025f85e54ab;hb=ebf3d5b66360823edbdf5ac4f9a119506fccd4c0;hpb=98ffa4e9bd75c8124378c712933bb13d2697b694 So either usage_count = 1 init in BufferAlloc is not correct or SyncOneBuffer() with skip_recently_used and usage_count=1 is not correct: if (bufHdr->refcount == 0 && bufHdr->usage_count == 0) result |= BUF_REUSABLE; else if (skip_recently_used) { /* Caller told us not to write recently-used buffers */ UnlockBufHdr(bufHdr); return result; } ############################################## Bug2: Double iteration of buffers ############################################## As you can seen in the calling tree below there is double iteration with buffers involved. This might be a major performance bottleneck. // Checkpoint buffer sync BufferSync() loop buffers: SyncOneBuffer() // skip_recently_used=false CheckpointWriteDelay() // Bug here?: Since BgBufferSync() is called were again is iterated!! CheckpointWriteDelay() if (IsCheckpointOnSchedule()) { BgBufferSync() CheckArchiveTimeout() BgWriterNap() } BgBufferSync() loop buffers: SyncOneBuffer() // skip_recently_used=true, ok here since we don't want to flush recently used block (e.g. indices).But improvement (e.g. aging) is IHMO necessary ############################################## BTW: Are there some tests available how fast a buffer cache hit is and a disk cache hit is (not in the buffer cache but in the disk cache)? I'll asked, because a lot of locking is involved in the code. BTW2: Oracle buffercache and background writer strategy is also interesting. http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/process.htm#i7259 http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/memory.htm#i10221 Thnx for feedback. Ciao, Gerhard -- http://www.wiesinger.com/ ----------------------------------- src/backend/postmaster/bgwriter.c ----------------------------------- BackgroundWriterMain() loop forever: timeout: CreateCheckPoint() // NON_IMMEDIATE smgrcloseall() nontimeout: BgBufferSync() sleep // Rest is done in XLogWrite() RequestCheckpoint() CreateCheckPoint() or signal through shared memory segment smgrcloseall() CheckpointWriteDelay() if (IsCheckpointOnSchedule()) { BgBufferSync() CheckArchiveTimeout() BgWriterNap() } ----------------------------------- src/backend/commands/dbcommands.c ----------------------------------- createdb() RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT dropdb() RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT ----------------------------------- src/backend/commands/tablespace.c ----------------------------------- DropTableSpace() RequestCheckpoint() ----------------------------------- src/backend/tcop/utility.c ----------------------------------- ProcessUtility() // Command CHECKPOINT; RequestCheckpoint() // CHECKPOINT_IMMEDIATE | CHECKPOINT_FORCE | CHECKPOINT_WAIT ----------------------------------- src/backend/access/transam/xlog.c ----------------------------------- CreateCheckPoint() CheckPointGuts() CheckPointCLOG() CheckPointSUBTRANS() CheckPointMultiXact() CheckPointBuffers()); /* performs all required fsyncs */ CheckPointTwoPhase() XLogWrite() too_much_transaction_log_consumed: RequestCheckpoint() // NON_IMMEDIATE pg_start_backup() RequestCheckpoint() // CHECKPOINT_FORCE | CHECKPOINT_WAIT XLogFlush() // Flush transaction log ----------------------------------- src/backend/storage/buffer/bufmgr.c ----------------------------------- CheckPointBuffers() BufferSync() smgrsync() // Checkpoint buffer sync BufferSync() loop buffers: SyncOneBuffer() // skip_recently_used=false CheckpointWriteDelay() // Bug here?: Since BgBufferSync() is called were again is iterated!! // Backgroundwriter buffer sync BgBufferSync() loop buffers: SyncOneBuffer() // skip_recently_used=true, ok here since we don't want to flush recently used block (e.g. indices).But improvement (e.g. aging) is IHMO necessary SyncOneBuffer() // Problem with skip_recently_used and usage_count=1 (not flushed!) FlushBuffer() FlushBuffer() XLogFlush() smgrwrite() CheckPointBuffers() BufferSync() smgrsync() BufferAlloc() // Init with usage_count=1 is not logically => Will never be flushed in bg_writer! PinBuffer() PinBuffer() usage_count++; ----------------------------------- src/backend/storage/buffer/localbuf.c ----------------------------------- LocalBufferAlloc() usage_count++; ----------------------------------- src/backend/storage/smgr/md.c ----------------------------------- smgrwrite() = mdwrite() => write file (not flushed immediatly) but registers for later flushling with register_dirty_segment at checkpoint time smgrsync() = mdsync() => Syncs registered non flushed files.
On Sun, 4 Oct 2009, Gerhard Wiesinger wrote: >> On Fri, 2 Oct 2009, Scott Marlowe wrote: >> >>> I found that lowering checkpoint completion target was what helped. >>> Does that seem counter-intuitive to you? >> > > I set it to 0.0 now. If you set that to 0.0, the whole checkpoing spreading logic doesn't apply like it's supposed to. I'm not sure what the results you posted mean now. If you had it set to 0 and saw a bad spike (which is how I read your message), I'd say "yes, that's what happens when you do reduce that parameter, so don't do that". If you meant something else please clarify. Thanks for the dtrace example, I suggested we add those checkpoint probes in there and someone did, but I hadn't seen anybody use them for anything yet. > Bug1: usage_count is IHMO not consistent It's a bit hack-ish, but the changes made to support multiple buffer use strategies introduced by the "Make large sequential scans and VACUUMs work in a limited-size ring" commit are reasonable even if they're not as consistent as we'd like. Those changes were supported by benchmarks proving their utility, which always trump theoretical "that shouldn't work better!" claims when profiling performance. Also, they make sense to me, but I've spent a lot of time staring at pg_buffercache output to get a feel for what shows up in there under various circumstances. That's where I'd suggest you go if this doesn't seem right to you; run some real database tests and use pg_buffercache to see what's inside the cache when you're done. What's in there and what I expected to be in there weren't always the same thing, and it's interesting to note how that changes as shared_buffers increases. I consider some time studying that a pre-requisite to analyzing performance of this code. > Bug2: Double iteration of buffers > As you can seen in the calling tree below there is double iteration with > buffers involved. This might be a major performance bottleneck. Hmmm, this might be a real bug causing scans through the buffer cache to go twice as fast as intended. Since the part you suggest is doubled isn't very intensive or called all that often, there's no way it can be a major issue though. That's based on knowing what the code does and how much it was called, as well as some confidence that if it were really a *major* problem, it would have shown up on the extensive benchmarks done on all the code paths you're investigating. > BTW: Are there some tests available how fast a buffer cache hit is and a disk > cache hit is (not in the buffer cache but in the disk cache)? I'll asked, > because a lot of locking is involved in the code. I did some once but didn't find anything particularly interesting about the results. Since you seem to be on a research tear here, it would be helpful to have a script to test that out available, I wasn't able to release mine and something dtrace based would probably be better than the approach I used (I threw a bunch of gettimeofdata calls into the logs and post-processed them with a script). > BTW2: Oracle buffercache and background writer strategy is also interesting. As a rule, we don't post links to other database implementation details here, as those can have patented design details we'd prefer not to intentionally re-implement. Much of Oracle's design here doesn't apply here anyway, as it was done in the era when all of their writes were synchronous. That required them to worry about doing a good job on some things in their background writer that we shrug off and let os writes combined with fsync handle instead. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Mon, 5 Oct 2009, Greg Smith wrote: > On Sun, 4 Oct 2009, Gerhard Wiesinger wrote: > >>> On Fri, 2 Oct 2009, Scott Marlowe wrote: >>> >>>> I found that lowering checkpoint completion target was what helped. >>>> Does that seem counter-intuitive to you? >>> >> >> I set it to 0.0 now. > > If you set that to 0.0, the whole checkpoing spreading logic doesn't apply > like it's supposed to. I'm not sure what the results you posted mean now. If > you had it set to 0 and saw a bad spike (which is how I read your message), > I'd say "yes, that's what happens when you do reduce that parameter, so don't > do that". If you meant something else please clarify. I think the problem is, that it is done on checkpoint time (whether spread or not). I should have been already be done by bgwriter. > > Thanks for the dtrace example, I suggested we add those checkpoint probes in > there and someone did, but I hadn't seen anybody use them for anything yet. > I think more probes (e.g. on different writing conditions like writing from bgwriter or on a checkpoint) would be interesting here. >> Bug1: usage_count is IHMO not consistent > > It's a bit hack-ish, but the changes made to support multiple buffer use > strategies introduced by the "Make large sequential scans and VACUUMs work in > a limited-size ring" commit are reasonable even if they're not as consistent > as we'd like. Those changes were supported by benchmarks proving their > utility, which always trump theoretical "that shouldn't work better!" claims > when profiling performance. > > Also, they make sense to me, but I've spent a lot of time staring at > pg_buffercache output to get a feel for what shows up in there under various > circumstances. That's where I'd suggest you go if this doesn't seem right to > you; run some real database tests and use pg_buffercache to see what's inside > the cache when you're done. What's in there and what I expected to be in > there weren't always the same thing, and it's interesting to note how that > changes as shared_buffers increases. I consider some time studying that a > pre-requisite to analyzing performance of this code. > I have analyzed pg_buffercache (query every second, see below) in parallel to see what happens but I didn't see expected results in some ways with the usage_counts. Therefore I analyzed the code and found IHMO the problem with the usage_count and buffer reallocation. Since the code change is also new (I think it way 05/2009) it might be that you tested before ... BTW: Is it possible to get everything in pg_class over all databases as admin? >> Bug2: Double iteration of buffers >> As you can seen in the calling tree below there is double iteration with >> buffers involved. This might be a major performance bottleneck. > > Hmmm, this might be a real bug causing scans through the buffer cache to go > twice as fast as intended. That's not twice O(2*n)=O(n) that's a factor n*n (outer and inner loop iteration) which means overall is O(n^2) which is IHMO too much. > Since the part you suggest is doubled isn't very > intensive or called all that often, there's no way it can be a major issue > though. It is a major issue since it is O(n^2) and not O(n). E.g. with 2GB share buffer we have 262144 blocks and 68719476736 calls which is far too much. > That's based on knowing what the code does and how much it was > called, as well as some confidence that if it were really a *major* problem, > it would have shown up on the extensive benchmarks done on all the code paths > you're investigating. > The problem might be hidden for the following reasons: 1.) Buffers values are too low that even n^2 is low for today's machines 2.) Code is not often called in that way 3.) backend writes out pages so that the code is never executed 4.) ... >> BTW: Are there some tests available how fast a buffer cache hit is and a >> disk cache hit is (not in the buffer cache but in the disk cache)? I'll >> asked, because a lot of locking is involved in the code. > > I did some once but didn't find anything particularly interesting about the > results. Since you seem to be on a research tear here, it would be helpful > to have a script to test that out available, I wasn't able to release mine > and something dtrace based would probably be better than the approach I used > (I threw a bunch of gettimeofdata calls into the logs and post-processed them > with a script). > Do you have an where one should set tracepoints inside and outside PostgreSQL? >> BTW2: Oracle buffercache and background writer strategy is also >> interesting. > > As a rule, we don't post links to other database implementation details here, > as those can have patented design details we'd prefer not to intentionally > re-implement. Much of Oracle's design here doesn't apply here anyway, as it > was done in the era when all of their writes were synchronous. That required > them to worry about doing a good job on some things in their background > writer that we shrug off and let os writes combined with fsync handle > instead. > Ok, no problem. Ciao, Gerhard -- http://www.wiesinger.com/ SELECT CASE WHEN datname IS NULL THEN pg_buffercache.reldatabase::text ELSE datname END AS database, CASE WHEN spcname IS NULL THEN pg_buffercache.reltablespace::text ELSE spcname END AS tablespace, CASE WHEN relname IS NULL THEN pg_buffercache.relfilenode::text ELSE relname END AS relation, CASE WHEN relkind IS NULL THEN pg_buffercache.relfilenode::text ELSE relkind END AS relkind, usagecount > 0 AS usagecount_gt_0, isdirty, MIN(relblocknumber) AS min_blocknumber, MAX(relblocknumber) AS max_blocknumber, ROUND(AVG(relblocknumber),2) AS avg_blocknumber, ROUND(STDDEV(relblocknumber),2) AS stddev_blocknumber, COUNT(*) AS count FROM pg_buffercache LEFT JOIN pg_class ON pg_buffercache.relfilenode = pg_class.oid LEFT JOIN pg_tablespace ON pg_buffercache.reltablespace = pg_tablespace.oid LEFT JOIN pg_database ON pg_buffercache.reldatabase = pg_database.oid WHERE isdirty = true GROUP BY CASE WHEN datname IS NULL THEN pg_buffercache.reldatabase::text ELSE datname END, CASE WHEN spcname IS NULL THEN pg_buffercache.reltablespace::text ELSE spcname END, CASE WHEN relkind IS NULL THEN pg_buffercache.relfilenode::text ELSE relkind END, CASE WHEN relname IS NULL THEN pg_buffercache.relfilenode::text ELSE relname END, usagecount > 0, isdirty --HAVING relkind = 'r' ORDER BY database, tablespace, relkind DESC, relation, usagecount > 0, isdirty ;
On Mon, 5 Oct 2009, Gerhard Wiesinger wrote: > I think the problem is, that it is done on checkpoint time (whether > spread or not). I should have been already be done by bgwriter. This is pretty simple: if you write things before checkpoint time, you'll end up re-writing a percentage of the blocks if they're re-dirtied before the checkpoint actually happens. The checkpoint itself is always the most efficient time to write something out. People think that the background writer should do more, but it can't without generating more writes than if you instead focused on spreading the checkpoints out instead. This is why the only work the BGW does try to do is writing out blocks that it's pretty sure are going to be evicted very soon (in the next 200ms, or whatever its cycle time is set to), to minimize the potential for mistakes. The design errors a bit on the side of doing too little because it is paranoid about not doing wasted work, and that implementation always beat one where the background writer was more aggressive in benchmarks. This is hard for people to accept, but there were three of us running independent tests to improve things here by the end of 8.3 development and everybody saw similar results as far as the checkpoint spreading approach being the right one. At the time the patch was labeled "load distributed checkpoint" and if I had more time today I'd try and find the more interesting parts of that discussion to highlight them. > BTW: Is it possible to get everything in pg_class over all databases as > admin? Scott's message at http://archives.postgresql.org/pgsql-general/2009-09/msg00986.php summarizes the problem nicely, and I suggested my workaround for it at http://archives.postgresql.org/pgsql-general/2009-09/msg00984.php >>> Bug2: Double iteration of buffers >>> As you can seen in the calling tree below there is double iteration with >>> buffers involved. This might be a major performance bottleneck. >> >> Hmmm, this might be a real bug causing scans through the buffer cache to go >> twice as fast as intended. > > That's not twice O(2*n)=O(n) that's a factor n*n (outer and inner loop > iteration) which means overall is O(n^2) which is IHMO too much. I follow what you mean, didn't notice that. SyncOneBuffer isn't a O(n) operation; it's O(1). So I'd think that the potential bug here turns into a O(n) issue then given it's the routine being called n times. This seems like a job for "dump things to the log file" style debugging. If I can reproduce an actual bug here it sounds like a topic for the hackers list outside of this discussion. > The problem might be hidden for the following reasons: > 1.) Buffers values are too low that even n^2 is low for today's machines > 2.) Code is not often called in that way > 3.) backend writes out pages so that the code is never executed (2) was the reason I figured it might have escaped notice. It's really not called that often in a way that would run into the problem you think is there. > Do you have an where one should set tracepoints inside and outside > PostgreSQL? I think you'd want to instrument BufferAlloc inside bufmgr.c to measure what you're after. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD