Re: Load distributed checkpoint - Mailing list pgsql-hackers

From Takayuki Tsunakawa
Subject Re: Load distributed checkpoint
Date
Msg-id 02a301c725ac$fbd6f600$19527c0a@OPERAO
Whole thread Raw
In response to Load distributed checkpoint  (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>)
List pgsql-hackers
From: "Greg Smith" <gsmith@gregsmith.com>
> This is actually a question I'd been meaning to throw out myself to
this
> list.  How hard would it be to add an internal counter to the buffer
> management scheme that kept track of the current number of dirty
pages?
> I've been looking at the bufmgr code lately trying to figure out how
to
> insert one as part of building an auto-tuning bgwriter, but it's
unclear
> to me how I'd lock such a resource properly and scalably.  I have a
> feeling I'd be inserting a single-process locking bottleneck into
that
> code with any of the naive implementations I considered.

To put it in an extreme way, how about making bgwriter count the dirty
buffers periodically scanning all the buffers?  Do you know the book
"Principles of Transaction Processing"?  Jim Gray was one of the
reviewers of this book.


http://www.amazon.com/gp/aa.html?HMAC=&CartId=&Operation=ItemLookup&&ItemId=1558604154&ResponseGroup=Request,Large,Variations&bStyle=aaz.jpg&MerchantId=All&isdetail=true&bsi=Books&logo=foo&Marketplace=us&AssociateTag=pocketpc

In chapter 8, the author describes fuzzy checkpoint combined with
two-checkpoint approach.  In his explanation, recovery manager (which
would be bgwriter in PostgreSQL) scans the buffers and records the
list of dirty buffers at each checkpoint.  This won't need any locking
in PostgreSQL if I understand correctly.  Then, the recovery manager
performs the next checkpoint after writing those dirty buffers.  In
two-checkpoint approach, crash recovery starts redoing from the second
to last checkpoint.  Two-checkpoint is described in Jim Gray's book,
too.  But they don't refer to how the recovery manager tunes the speed
of writing.


> slightly different from the proposals here.  What if all the
database page
> writes (background writer, buffer eviction, or checkpoint scan) were
> counted and periodic fsync requests send to the bgwriter based on
that?
> For example, when I know I have a battery-backed caching controller
that
> will buffer 64MB worth of data for me, if I forced a fsync after
every
> 6000 8K writes, no single fsync would get stuck waiting for the disk
to
> write for longer than I'd like.

That seems interesting.

> You can do sync
> writes with perfectly good performance on systems with a good
> battery-backed cache, but I think you'll get creamed in comparisons
> against MySQL on IDE disks if you start walking down that path;
since
> right now a fair comparison with similar logging behavior is an even
match
> there, that's a step backwards.

I wonder what characteristics SATA disks have compared to IDE.  Recent
PCs are equiped with SATA disks, aren't they?
What do you feel your approach compares to MySQL on IDE disks?

> Also on the topic of sync writes to the database proper:  wouldn't
using
> O_DIRECT for those potentially counter-productive?  I was under the
> impressions that one of the behaviors counted on by Postgres was
that data
> evicted from its buffer cache, eventually intended for writing to
disk,
> was still kept around for a bit in the OS buffer cache.  A
subsequent read
> because the data was needed again might find the data already in the
OS
> buffer, therefore avoiding an actual disk read; that substantially
reduces
> the typical penalty for the database engine making a bad choice on
what to
> evict.  I fear a move to direct writes would put more pressure on
the LRU
> implementation to be very smart, and that's code that you really
don't
> want to be more complicated.

I'm worried about this, too.





pgsql-hackers by date:

Previous
From: "Zeugswetter Andreas ADI SD"
Date:
Subject: Re: column ordering, was Re: [PATCHES] Enums patch v2
Next
From: "Nikolay Samokhvalov"
Date:
Subject: Re: configure problem --with-libxml