Thread: ACM Paper relevant to our buffer algorithm
Incidentally I found this paper in ACM SIGMETRICS 1992 covering more or less precisely the same algorithm we're using for our clock sweep. I haven't quite digested it yet myself so I'm not sure what the conclusions about weights tell us to do with our buffer usage counter. I put a copy up for download since even though it's permitted to copy I don't know how to find a public link. It's kind of big so please download it and read it locally, don't try to read it from my machine: http://stark.xeocode.com/~stark/p35-nicola.pdf.gz Regarding copyright the paper sez: Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of ths Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Here are some more recent papers that also give good insight into research in this area: http://www.cs.usask.ca/~wew036/comprehensive.pdf http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-3.pdf -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith wrote: > Here are some more recent papers that also give good insight into > research in this area: > > http://www.cs.usask.ca/~wew036/comprehensive.pdf > http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-3.pdf Nice papers. What I'd like to see is a paper on precleaning strategies. I tried to google for one but couldn't find any. That would be very relevant to the bgwriter tuning discussions. I'm still struggling to understand why and how bgwriter increases performance. Under what circumstances, what workload? The only benefit I can see is that it moves the write() of a page out of the critical path. But as long as the OS cache can absorb the write, it should be very cheap compared to doing real I/O. Apparently the workload that benefits most is an OLTP workload where response times are critical, on a database that doesn't fit in share_buffers, but fits in OS cache. Am I missing something? What kind of a test would show the most benefit? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Wed, Jul 04, 2007 at 11:09:19AM +0100, Heikki Linnakangas wrote: > The only benefit I can see is that it moves the write() of a page out of > the critical path. But as long as the OS cache can absorb the write, it > should be very cheap compared to doing real I/O. Apparently the workload > that benefits most is an OLTP workload where response times are > critical, on a database that doesn't fit in share_buffers, but fits in > OS cache. I thought the point was to make checkpoints cheaper. Sure, the OS can probably absorb the write() but you execute a fsync() shortly after so you're going to block on that. The point being that by executing the writes earlier you can get some of the writing done in the bakcground prior to the fsync. So it would be targetting people with lots of dirty shared buffers where a checkpoint is going to eat your I/O bandwidth. An fsync will make the OS write everything ASAP. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote: > On Wed, Jul 04, 2007 at 11:09:19AM +0100, Heikki Linnakangas wrote: >> The only benefit I can see is that it moves the write() of a page out of >> the critical path. But as long as the OS cache can absorb the write, it >> should be very cheap compared to doing real I/O. Apparently the workload >> that benefits most is an OLTP workload where response times are >> critical, on a database that doesn't fit in share_buffers, but fits in >> OS cache. > > I thought the point was to make checkpoints cheaper. Sure, the OS can > probably absorb the write() but you execute a fsync() shortly after so > you're going to block on that. The point being that by executing the > writes earlier you can get some of the writing done in the bakcground > prior to the fsync. That was the purpose of the bgwriter "all"-scan. It was removed as part of the load distributed checkpoints patch, as it's not needed anymore. What's the purpose of the lru-scan? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki@enterprisedb.com> writes: > Greg Smith wrote: > > Here are some more recent papers that also give good insight into research in > > this area: > > http://www.cs.usask.ca/~wew036/comprehensive.pdf > > http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-05-3.pdf > > Nice papers. > > What I'd like to see is a paper on precleaning strategies. I tried to google > for one but couldn't find any. That would be very relevant to the bgwriter > tuning discussions. I saw some listed but they were for VM managers so they may not be perfectly adaptable. > I'm still struggling to understand why and how bgwriter increases performance. > Under what circumstances, what workload? > > The only benefit I can see is that it moves the write() of a page out of the > critical path. But as long as the OS cache can absorb the write, it should be > very cheap compared to doing real I/O. Well you can't keep writing indefinitely faster than the i/o subsystem can execute the writes. Eventually the kernel will block your write until a kernel buffer becomes free. Ie, throttle your writes to the actual write bandwidth available. I think most systems are limited by the latency of reads though. Reads can't be re-ordered as well as writes can, they happen synchronously as far as an individual backend is concerned. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <gsstark@mit.edu> writes: > Heikki Linnakangas <heikki@enterprisedb.com> writes: >> I'm still struggling to understand why and how bgwriter increases performance. >> Under what circumstances, what workload? >> >> The only benefit I can see is that it moves the write() of a page out of the >> critical path. But as long as the OS cache can absorb the write, it should be >> very cheap compared to doing real I/O. > Well you can't keep writing indefinitely faster than the i/o subsystem can > execute the writes. Eventually the kernel will block your write until a kernel > buffer becomes free. Ie, throttle your writes to the actual write bandwidth > available. Right. Also, a buffer write isn't "merely" a kernel call --- for instance, you might have to flush some more WAL before you can execute the write, and there are cases where you'd have to fsync the write yourself (ie, if you can't pass it off to the bgwriter). The more of that we can take out of foreground query paths, the better. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > Gregory Stark <gsstark@mit.edu> writes: >> Heikki Linnakangas <heikki@enterprisedb.com> writes: >>> I'm still struggling to understand why and how bgwriter increases performance. >>> Under what circumstances, what workload? >>> >>> The only benefit I can see is that it moves the write() of a page out of the >>> critical path. But as long as the OS cache can absorb the write, it should be >>> very cheap compared to doing real I/O. > >> Well you can't keep writing indefinitely faster than the i/o subsystem can >> execute the writes. Eventually the kernel will block your write until a kernel >> buffer becomes free. Ie, throttle your writes to the actual write bandwidth >> available. > > Right. Also, a buffer write isn't "merely" a kernel call --- for > instance, you might have to flush some more WAL before you can execute > the write, and there are cases where you'd have to fsync the write > yourself (ie, if you can't pass it off to the bgwriter). The more of > that we can take out of foreground query paths, the better. So it sounds like a good place to start to try to benchmark something where bgwriter helps might be a setup which starts with a very large table (like 1-10M rows) and each transaction deletes a large number of random tuples (~ 1k rows). Possibly waiting briefly before committing to give a chance for the dirty pages to be needed before commit flushes the wal. That way each transaction dirties a large number of pages and the next transaction is likely to need a fresh page and find one which has been dirtied and not had its wal record flushed. Deletes mean not much wal will be generated so wal won't be a bottleneck and you won't get checkpoints due to checkpoint_segments being reached. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com