Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers

From KONDO Mitsumasa
Subject Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date
Msg-id 51E8F080.4040506@lab.ntt.co.jp
Whole thread Raw
In response to Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: Improvement of checkpoint IO scheduler for stable transaction responses
List pgsql-hackers
(2013/07/19 0:41), Greg Smith wrote:
> On 7/18/13 11:04 AM, Robert Haas wrote:
>> On a system where fsync is sometimes very very slow, that
>> might result in the checkpoint overrunning its time budget - but SO
>> WHAT?
>
> Checkpoints provide a boundary on recovery time.  That is their only purpose.
> You can always do better by postponing them, but you've now changed the agreement
> with the user about how long recovery might take.
Recently, a user who think system availability is important uses synchronous 
replication cluster. And, as Robert saying, a user who cannot build cluster 
system will not use this function in GUC.

When it became IO busy in calling fsync(), my patch does not take the over IO 
load in fsync(). Actually, it is the same as OS writeback structure. I read 
kernel source code which is fs/fs-writeback.c in linux-2.6.32-358.0.1.el6. It is 
latest RHEL6.4 kernel code. It seems that wb_writeback() controlled disk IO in 
OS-writeback function. Please see under source code. If OS think IO is busy, it 
does not write more IO for bail.

fs/fs-writeback.c @wb_writeback() 623                 /* 624                  * For background writeout, stop when we
arebelow the 625                  * background dirty threshold 626                  */ 627                 if
(work->for_background&& !over_bground_thresh()) 628                         break; 629 630                 wbc.more_io
=0; 631                 wbc.nr_to_write = MAX_WRITEBACK_PAGES; 632                 wbc.pages_skipped = 0; 633 634
         trace_wbc_writeback_start(&wbc, wb->bdi); 635                 if (work->sb) 636
__writeback_inodes_sb(work->sb,wb, &wbc); 637                 else 638                         writeback_inodes_wb(wb,
&wbc);639                 trace_wbc_writeback_written(&wbc, wb->bdi); 640                 work->nr_pages -=
MAX_WRITEBACK_PAGES- wbc.nr_to_write; 641                 wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write; 642 643
         /* 644                  * If we consumed everything, see if we have more 645                  */ 646
     if (wbc.nr_to_write <= 0) 647                         continue; 648                 /* 649                  *
Didn'twrite everything and we don't have more IO, bail 650                  */ 651                 if (!wbc.more_io)
652                        break; 653                 /* 654                  * Did we write something? Try for more
655                 */ 656                 if (wbc.nr_to_write < MAX_WRITEBACK_PAGES) 657
continue;658                 /* 659                  * Nothing written. Wait for some inode to 660                  *
becomeavailable for writeback. Otherwise 661                  * we'll just busyloop. 662                  */ 663
        spin_lock(&inode_lock); 664                 if (!list_empty(&wb->b_more_io))  { 665
inode= list_entry(wb->b_more_io.prev, 666                                                 struct inode, i_list); 667
                    trace_wbc_writeback_wait(&wbc, wb->bdi); 668
inode_wait_for_writeback(inode);669                 } 670                 spin_unlock(&inode_lock); 671         } 672
673        return wrote;
 

I want you to read especially point that is line 631, 651, and 656. 
MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte). OS writeback scheduler does not 
write over MAX_WRITEBACK_PAGES. Because, if it write big data than 
MAX_WRITEBACK_PAGES, it will be IO-busy. And if it cannot write at all, OS think 
it needs recovery of IO performance. It is same as my patch's logic.

In addition, if you set a large value of a checkpoint_timeout or 
checkpoint_complete_taget, you have said that performance is improved, but is it 
true in all the cases? Since the write of the dirty buffer which passed 30 
seconds or more is carried out at intervals of 5 seconds, as there are many 
recesses of a write, a possibility of becoming an inefficient random write. For 
example, as for the worsening case, when the sleep time for 200 ms is inserted 
each time, since only 25 page (200 KB) can write in 5 seconds. I think it is bad 
efficiency to write. When a checkpoint complication target is actually enlarged, 
performance may fall in some cases. I think this as the last fsync having become 
heavy owing to having write in slowly.

I would like to make a itemizing list which can be proof of my patch from you. 
Because DBT-2 benchmark spent lot of time about 1 setting test per 3 - 4 hours. 
Of course, I think it is important to obtain your consent.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: AGG_PLAIN thinks sorts are free
Next
From: Samrat Revagade
Date:
Subject: Using ini file to setup replication