Re: Design proposal: fsync absorb linear slider - Mailing list pgsql-hackers
From | Greg Smith |
---|---|
Subject | Re: Design proposal: fsync absorb linear slider |
Date | |
Msg-id | 521C7ED6.3010108@2ndQuadrant.com Whole thread Raw |
In response to | Re: Design proposal: fsync absorb linear slider (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>) |
List | pgsql-hackers |
On 7/29/13 2:04 AM, KONDO Mitsumasa wrote: > I think that it is almost same as small dirty_background_ratio or > dirty_background_bytes. The main difference here is that all writes pushed out this way will be to a single 1GB relation chunk. The odds are better that multiple writes will combine, and that the I/O will involve a lower than average amount of random seeking. Whereas shrinking the size of the write cache always results in more random seeking. > The essential improvement is not dirty page size in fsync() but > scheduling of fsync phase. > I can't understand why postgres does not consider scheduling of fsync > phase. Because it cannot get the sort of latency improvements I think people want. I proved to myself it's impossible during the last 9.2 CF when I submitted several fsync scheduling change submissions. By the time you get to the fsync sync phase, on a system that's always writing heavily there is way too much backlog to possibly cope with by then. There just isn't enough time left before the checkpoint should end to write everything out. You have to force writes to actual disk to start happening earlier to keep a predictable schedule. Basically, the longer you go without issuing a fsync, the more uncertainty there is around how long it might take to fire. My proposal lets someone keep all I/O from ever reaching the point where the uncertainty is that high. In the simplest to explain case, imagine that a checkpoint includes a 1GB relation segment that is completely dirty in shared_buffers. When a checkpoint hits this, it will have 1GB of I/O to push out. If you have waited this long to fsync the segment, the problem is now too big to fix by checkpoint time. Even if the 1GB of writes are themselves nicely ordered and grouped on disk, the concurrent background ability is going to chop the combination up into more random I/O than the ideal. Regular consumer disks have a worst case random I/O throughput of less than 2MB/s. My observed progress rates for such systems show you're lucky to get 10MB/s of writes out of them. So how long will the dirty 1GB in the segment take to write? 1GB @ 10MB/s = 102.4 *seconds*. And that's exactly what I saw whenever I tried to play with checkpoint sync scheduling. No matter what you do there, periodically you'll hit a segment that has over a minute of dirty data accumulated, and >60 second latency pauses result. By the point you've reached checkpoint, you're dead when you call fsync on that relation. You *must* hit that segment with fsync more often than once per checkpoint to achieve reasonable latency. With this "linear slider" idea, I might tune such that no segment will ever get more than 256MB of writes before hitting a fsync instead. I can't guarantee that will work usefully, but the shape of the idea seems to match the problem. > Taken together my checkpoint proposal method, > * write phase > - Almost same, but considering fsync phase schedule. > - Considering case of background-write in OS, sort buffer before > starting checkpoint write. This cannot work for the reasons I've outlined here. I guarantee you I will easily find a test workload where it performs worse than what's happening right now. If you want to play with this to learn more about the trade-offs involved, that's fine, but expect me to vote against accepting any change of this form. I would prefer you to not submit them because it will waste a large amount of reviewer time to reach that conclusion yet again. And I'm not going to be that reviewer. > * fsync phase > - Considering checkpoint schedule and write-phase schedule > - Executing separated sync_file_range() and sleep, in final fsync(). If you can figure out how to use sync_file_range() to fine tune how much fsync is happening at any time, that would be useful on all the platforms that support it. I haven't tried it just because that looked to me like a large job refactoring the entire fsync absorb mechanism, and I've never had enough funding to take it on. That approach has a lot of good properties, if it could be made to work without a lot of code changes. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com
pgsql-hackers by date: