Re: Load distributed checkpoint - Mailing list pgsql-hackers

From Takayuki Tsunakawa
Subject Re: Load distributed checkpoint
Date
Msg-id 011c01c7249a$4fa27980$19527c0a@OPERAO
Whole thread Raw
In response to Load distributed checkpoint  (ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp>)
List pgsql-hackers
On 12/20/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:
> > [Conclusion]
> > I believe that the problem cannot be solved in a real sense by
> > avoiding fsync/fdatasync().  We can't ignore what commercial databases
> > have done so far.  The kernel does as much as he likes when PostgreSQL
> > requests him to fsync().
 

From: Inaam Rana
> I am new to the community and am very interested in the tests that you have done. I am also working on resolving the sudden IO spikes at checkpoint time. I agree with you that fsync() is the core issue here.
 
Thank you for understanding my bad English correctly.  Yes, what I've been insisting is that it is necessary to avoid fsync()/fdatasync() and to use O_SYNC (plus O_DIRECT if supported on the target platform) to really eliminate the big spikes.
In my mail, the following sentence made a small mistake.
 
"I believe that the problem cannot be solved in a real sense by avoiding fsync/fdatasync()."
 
The correct sentence is:
 
"I believe that the problem cannot be solved in a real sense without avoiding fsync/fdatasync()."
 

> Being a new member I was wondering if someone on this list has done testing with O_DIRECT and/or O_SYNC for datafiles as that seems to be the most logical way of dealing with fsync() flood at checkpoint time. If so, I'll be very interested in the results.
 
Could you see the mail I sent on Dec 18?  Its content was so long that I zipped the whole content and attached to the mail.  I just performed the same test simply adding O_SYNC to open() in mdopen() and another function in md.c.  I couldn't succeed in running with O_DIRECT because O_DIRECT requires the shared buffers to be aligned on the sector-size boundary.  To perform O_DIRECT test, a little more modification is necessary to the code where the shared buffers are allocated.
The result was bad.  But that's just a starting point.  We need some improvements that commercial databases have done.  I think some approaches we should take are:
 
(1) two-checkpoint (described in Jim Gray's textbook "Transaction Processing: Concepts and Techniques"
(2) what Oracle suggests in its manual (see my previous mails)
(3) write multiple contiguous buffers with one write() to decrease the count of write() calls
 
> As mentioned in this thread that a single bgwriter with O_DIRECT will not be able to keep pace with cleaning effort causing backend writes. I think (i.e. IMHO) multiple bgwriters and/or AsyncIO with O_DIRECT can resolve this issue.
 
I agree with you.  Oracle provides a parameter called DB_WRITER_PROCESSES to set the number of database writer processes.  Oracle also provides asynchronous I/O to solve the problem you are saying about.  Please see section 10.3.9 the following page:
 
 
> Talking of bgwriter_* parameters I think we are missing a crucial internal counter i.e. number of dirty pages. How much work bgwriter has to do at each wakeup call should be a function of total buffers and currently dirty buffers. Relying on both these values instead of just one static NBuffers should allow bgwriter to adapt more quickly to workload changes and ensure that not much work is accumulated for checkpoint.
 
I agree with you in the sense that the current bgwriter is a bit careless about the system load.  I believe that PostgreSQL should be more gentle to OLTP transactions -- many users of the system as a result.  I think the speed of WAL accumulation should also be taken into account.  Let's list up the problems and ideas.
 
--

pgsql-hackers by date:

Previous
From: Russell Smith
Date:
Subject: Re: Interface for pg_autovacuum
Next
From: "D'Arcy J.M. Cain"
Date:
Subject: Re: New version of money type