Re: Design proposal: fsync absorb linear slider - Mailing list pgsql-hackers

From Greg Smith
Subject Re: Design proposal: fsync absorb linear slider
Date
Msg-id 521C7ED6.3010108@2ndQuadrant.com
Whole thread Raw
In response to Re: Design proposal: fsync absorb linear slider  (KONDO Mitsumasa <kondo.mitsumasa@lab.ntt.co.jp>)
List pgsql-hackers
On 7/29/13 2:04 AM, KONDO Mitsumasa wrote:
> I think that it is almost same as small dirty_background_ratio or
> dirty_background_bytes.

The main difference here is that all writes pushed out this way will be 
to a single 1GB relation chunk.  The odds are better that multiple 
writes will combine, and that the I/O will involve a lower than average 
amount of random seeking.  Whereas shrinking the size of the write cache 
always results in more random seeking.

> The essential improvement is not dirty page size in fsync() but
> scheduling of fsync phase.
> I can't understand why postgres does not consider scheduling of fsync
> phase.

Because it cannot get the sort of latency improvements I think people 
want.  I proved to myself it's impossible during the last 9.2 CF when I 
submitted several fsync scheduling change submissions.

By the time you get to the fsync sync phase, on a system that's always 
writing heavily there is way too much backlog to possibly cope with by 
then.  There just isn't enough time left before the checkpoint should 
end to write everything out.  You have to force writes to actual disk to 
start happening earlier to keep a predictable schedule.  Basically, the 
longer you go without issuing a fsync, the more uncertainty there is 
around how long it might take to fire.  My proposal lets someone keep 
all I/O from ever reaching the point where the uncertainty is that high.

In the simplest to explain case, imagine that a checkpoint includes a 
1GB relation segment that is completely dirty in shared_buffers.  When a 
checkpoint hits this, it will have 1GB of I/O to push out.

If you have waited this long to fsync the segment, the problem is now 
too big to fix by checkpoint time.  Even if the 1GB of writes are 
themselves nicely ordered and grouped on disk, the concurrent background 
ability is going to chop the combination up into more random I/O than 
the ideal.

Regular consumer disks have a worst case random I/O throughput of less 
than 2MB/s.  My observed progress rates for such systems show you're 
lucky to get 10MB/s of writes out of them.  So how long will the dirty 
1GB in the segment take to write?  1GB @ 10MB/s = 102.4 *seconds*.  And 
that's exactly what I saw whenever I tried to play with checkpoint sync 
scheduling.  No matter what you do there, periodically you'll hit a 
segment that has over a minute of dirty data accumulated, and >60 second 
latency pauses result.  By the point you've reached checkpoint, you're 
dead when you call fsync on that relation.  You *must* hit that segment 
with fsync more often than once per checkpoint to achieve reasonable 
latency.

With this "linear slider" idea, I might tune such that no segment will 
ever get more than 256MB of writes before hitting a fsync instead.  I 
can't guarantee that will work usefully, but the shape of the idea seems 
to match the problem.

> Taken together my checkpoint proposal method,
> * write phase
>    - Almost same, but considering fsync phase schedule.
>    - Considering case of background-write in OS, sort buffer before
> starting checkpoint write.

This cannot work for the reasons I've outlined here.  I guarantee you I 
will easily find a test workload where it performs worse than what's 
happening right now.  If you want to play with this to learn more about 
the trade-offs involved, that's fine, but expect me to vote against 
accepting any change of this form.  I would prefer you to not submit 
them because it will waste a large amount of reviewer time to reach that 
conclusion yet again.  And I'm not going to be that reviewer.

> * fsync phase
>    - Considering checkpoint schedule and write-phase schedule
>    - Executing separated sync_file_range() and sleep, in final fsync().

If you can figure out how to use sync_file_range() to fine tune how much 
fsync is happening at any time, that would be useful on all the 
platforms that support it.  I haven't tried it just because that looked 
to me like a large job refactoring the entire fsync absorb mechanism, 
and I've never had enough funding to take it on.  That approach has a 
lot of good properties, if it could be made to work without a lot of 
code changes.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Clarification on materialized view restriction needed
Next
From: David Rowley
Date:
Subject: Re: Patch: Allow formatting in log_line_prefix