Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers

From KONDO Mitsumasa
Subject Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date
Msg-id 51CAA84C.7030901@lab.ntt.co.jp
Whole thread Raw
In response to Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Responses Re: Improvement of checkpoint IO scheduler for stable transaction responses
List pgsql-hackers
Thank you for comments!
>> On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
>>> Hmm, so the write patch doesn't do much, but the fsync patch makes the response
>>> times somewhat smoother. I'd suggest that we drop the write patch for now, and>>> focus on the fsyncs.
Write patch is effective in TPS! I think that delay of checkpoint write is caused
long time fsync and heavy load in fsync phase. Because it go slow disk right in write
phase. Therefore, combination of write patch and fsync patch are suiter each 
other than
only write patch. I think that amount of WAL write in beginning of checkpoint can 
indicate effect of write patch.
>>> What checkpointer_fsync_delay_ratio and checkpointer_fsync_delay_threshold >>> settings did you use with the fsync
patch?It's disabled by default.
 
I used these parameters.  checkpointer_fsync_delay_ratio = 1  checkpointer_fsync_delay_threshold = 1000ms
As a matter of fact, I used long time sleep in slow fsyncs.

And other maintains parameters are here.  checkpoint_completion_target = 0.7  checkpoint_smooth_target = 0.3
checkpoint_smooth_margin= 0.5  checkpointer_write_delay = 200ms
 

>>> Attached is a quick patch to implement a fixed, 100ms delay between fsyncs, and the
>>> assumption that fsync phase is 10% of the total checkpoint duration. I suspect 100ms>>> is too small to have much
effect,but that happens to be what we have 
 
currently in
>>> CheckpointWriteDelay(). Could you test this patch along with yours? If you can test
>>> with different delays (e.g 100ms, 500ms and 1000ms) and different ratios between
>>> the write and fsync phase (e.g 0.5, 0.7, 0.9), to get an idea of how sensitive the
>>> test case is to those settings.
It seems interesting algorithm! I will test it in same setting and study about 
your patch essence.


(2013/06/26 5:28), Heikki Linnakangas wrote:
> On 25.06.2013 23:03, Robert Haas wrote:
>> On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com>  wrote:
>>> I'm not sure it's a good idea to sleep proportionally to the time it took to
>>> complete the previous fsync. If you have a 1GB cache in the RAID controller,
>>> fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
>>> will return immediately. So we proceed fsyncing other files, until the cache
>>> is full and the fsync blocks. But once we fill up the cache, it's likely
>>> that we're hurting concurrent queries. ISTM it would be better to stay under
>>> that threshold, keeping the I/O system busy, but never fill up the cache
>>> completely.
>>
>> Isn't the behavior implemented by the patch a reasonable approximation
>> of just that?  When the fsyncs start to get slow, that's when we start
>> to sleep.   I'll grant that it would be better to sleep when the
>> fsyncs are *about* to get slow, rather than when they actually have
>> become slow, but we have no way to know that.
>
> Well, that's the point I was trying to make: you should sleep *before* the fsyncs
> get slow.
Actuary, fsync time is changed by progress of background disk writes in OS. We 
cannot know about progress of background disk write before fsyncs. I think 
Robert's argument is right. Please see under following log messages.

* fsync file which had been already wrote in disk DEBUG:  00000: checkpoint sync: number=23 file=base/16384/16413.5
time=2.546msec DEBUG:  00000: checkpoint sync: number=24 file=base/16384/16413.6 time=3.174 msec DEBUG:  00000:
checkpointsync: number=25 file=base/16384/16413.7 time=2.358 msec DEBUG:  00000: checkpoint sync: number=26
file=base/16384/16413.8time=2.013 msec DEBUG:  00000: checkpoint sync: number=27 file=base/16384/16413.9 time=1232.535

msec DEBUG:  00000: checkpoint sync: number=28 file=base/16384/16413_fsm time=0.005 msec

* fsync file which had not been wrote in disk very much DEBUG:  00000: checkpoint sync: number=54
file=base/16384/16419.8time=3408.759 
 
msec DEBUG:  00000: checkpoint sync: number=55 file=base/16384/16419.9 time=3857.075 
msec DEBUG:  00000: checkpoint sync: number=56 file=base/16384/16419.10 
time=13848.237 msec DEBUG:  00000: checkpoint sync: number=57 file=base/16384/16419.11 time=898.836 
msec DEBUG:  00000: checkpoint sync: number=58 file=base/16384/16419_fsm time=0.004 msec DEBUG:  00000: checkpoint
sync:number=59 file=base/16384/16419_vm time=0.002 msec
 

I think it is wasteful of sleep every fsyncs including short time, and fsync time 
performance is also changed by hardware which is like RAID card and kind of or 
number of disks and OS. So it is difficult to set fixed-sleep-time. My proposed 
method will be more adoptive in these cases.

>> The only feedback we have on how bad things are is how long it took
>> the last fsync to complete, so I actually think that's a much better
>> way to go than any fixed sleep - which will often be unnecessarily
>> long on a well-behaved system, and which will often be far too short
>> on one that's having trouble. I'm inclined to think think Kondo-san
>> has got it right.
>
> Quite possible, I really don't know. I'm inclined to first try the simplest thing
> possible, and only make it more complicated if that's not good enough.
> Kondo-san's patch wasn't very complicated, but nevertheless a fixed sleep between
> every fsync, unless you're behind the schedule, is even simpler. In particular,
> it's easier to tie that into the checkpoint scheduler - I'm not sure how you'd
> measure progress or determine how long to sleep unless you assume that every
> fsync is the same.
I think it is important in phase of fsync that short time as possible without IO 
freeze, keep schedule of checkpoint, and good for executing transactions. I try 
to make progress patch in that's point of view. By the way, executing DBT-2 
benchmark has long time(It may be four hours.). For that reason I hope that don't 
mind my late reply very much! :-)

Best Regards,
--
Mitsumasa KONDO
NTT Open Sorce Software Center



pgsql-hackers by date:

Previous
From: "Yuri Levinsky"
Date:
Subject: Re: Hash partitioning.
Next
From: Szymon Guz
Date:
Subject: Re: [PATCH] Fix conversion for Decimal arguments in plpython functions