Re: Load distributed checkpoint - Mailing list pgsql-hackers
From | Takayuki Tsunakawa |
---|---|
Subject | Re: Load distributed checkpoint |
Date | |
Msg-id | 02a301c725ac$fbd6f600$19527c0a@OPERAO Whole thread Raw |
In response to | Load distributed checkpoint (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>) |
List | pgsql-hackers |
From: "Greg Smith" <gsmith@gregsmith.com> > This is actually a question I'd been meaning to throw out myself to this > list. How hard would it be to add an internal counter to the buffer > management scheme that kept track of the current number of dirty pages? > I've been looking at the bufmgr code lately trying to figure out how to > insert one as part of building an auto-tuning bgwriter, but it's unclear > to me how I'd lock such a resource properly and scalably. I have a > feeling I'd be inserting a single-process locking bottleneck into that > code with any of the naive implementations I considered. To put it in an extreme way, how about making bgwriter count the dirty buffers periodically scanning all the buffers? Do you know the book "Principles of Transaction Processing"? Jim Gray was one of the reviewers of this book. http://www.amazon.com/gp/aa.html?HMAC=&CartId=&Operation=ItemLookup&&ItemId=1558604154&ResponseGroup=Request,Large,Variations&bStyle=aaz.jpg&MerchantId=All&isdetail=true&bsi=Books&logo=foo&Marketplace=us&AssociateTag=pocketpc In chapter 8, the author describes fuzzy checkpoint combined with two-checkpoint approach. In his explanation, recovery manager (which would be bgwriter in PostgreSQL) scans the buffers and records the list of dirty buffers at each checkpoint. This won't need any locking in PostgreSQL if I understand correctly. Then, the recovery manager performs the next checkpoint after writing those dirty buffers. In two-checkpoint approach, crash recovery starts redoing from the second to last checkpoint. Two-checkpoint is described in Jim Gray's book, too. But they don't refer to how the recovery manager tunes the speed of writing. > slightly different from the proposals here. What if all the database page > writes (background writer, buffer eviction, or checkpoint scan) were > counted and periodic fsync requests send to the bgwriter based on that? > For example, when I know I have a battery-backed caching controller that > will buffer 64MB worth of data for me, if I forced a fsync after every > 6000 8K writes, no single fsync would get stuck waiting for the disk to > write for longer than I'd like. That seems interesting. > You can do sync > writes with perfectly good performance on systems with a good > battery-backed cache, but I think you'll get creamed in comparisons > against MySQL on IDE disks if you start walking down that path; since > right now a fair comparison with similar logging behavior is an even match > there, that's a step backwards. I wonder what characteristics SATA disks have compared to IDE. Recent PCs are equiped with SATA disks, aren't they? What do you feel your approach compares to MySQL on IDE disks? > Also on the topic of sync writes to the database proper: wouldn't using > O_DIRECT for those potentially counter-productive? I was under the > impressions that one of the behaviors counted on by Postgres was that data > evicted from its buffer cache, eventually intended for writing to disk, > was still kept around for a bit in the OS buffer cache. A subsequent read > because the data was needed again might find the data already in the OS > buffer, therefore avoiding an actual disk read; that substantially reduces > the typical penalty for the database engine making a bad choice on what to > evict. I fear a move to direct writes would put more pressure on the LRU > implementation to be very smart, and that's code that you really don't > want to be more complicated. I'm worried about this, too.
pgsql-hackers by date: