Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers

From KONDO Mitsumasa
Subject Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date
Msg-id 51ECF2CF.6080109@lab.ntt.co.jp
Whole thread Raw
In response to Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: Improvement of checkpoint IO scheduler for stable transaction responses
List pgsql-hackers
(2013/07/19 22:48), Greg Smith wrote:
> On 7/19/13 3:53 AM, KONDO Mitsumasa wrote:
>> Recently, a user who think system availability is important uses
>> synchronous replication cluster.
>
> If your argument for why it's OK to ignore bounding crash recovery on the master
> is that it's possible to failover to a standby, I don't think that is
> acceptable.  PostgreSQL users certainly won't like it.
OK. I will also test recovery time. However, I consider more good practice now, I 
test it with new patch.

>> I want you to read especially point that is line 631, 651, and 656.
>> MAX_WRITEBACK_PAGES is 1024 (1024 * 4096 byte).
>
> You should read http://www.westnet.com/~gsmith/content/linux-pdflush.htm to
> realize everything you're telling me about the writeback code and its congestion
> logic I knew back in 2007.  The situation is even worse than you describe,
> because this section of Linux has gone through multiple, major revisions since
> then.  You can't just say "here is the writeback source code"; you have to
> reference each of the commonly deployed versions of the writeback feature to tell
> how this is going to play out if released.  There are four major ones I pay
> attention to.  The old kernel style as see in RHEL5/2.6.18--that's what my 2007
> paper discussed--the similar code but with very different defaults in 2.6.22, the
> writeback method/tuning in RHEL6/Debian Squeeze/2.6.32, and then there are newer
> kernels.  (The newer ones separate out into a few branches too, I haven't mapped
> those as carefully yet)
The writeback source code which I indicated part of writeback is almost same as 
community kernel (2.6.32.61). I also read linux kernel 3.9.7, but it is almost 
same this part. I think that fs-writeback.c is easier than xlog.c. It is only 
1309 steps. I think that linux distributions are only different about tuning 
parameter, but same as programing logic. Do you think to need reading debian 
kernel source code? I will read part of this code, because it is only scores of 
steps at most.

>  There are some examples of what really bad checkpoints look
> like in
> http://www.2ndquadrant.com/static/2quad/media/pdfs/talks/WriteStuff-PGCon2011.pdf
> if you want to see some of them.  That's the talk I did around the same time I
> was trying out spreading the database fsync calls out over a longer period.
Does it cause in ext3 or 4 file system? I think this is bug of XFS. If fsync call 
doesn't return,
it indicate cannot writing WAL and not return their commit. It is seriously problem.

My fsync patch is only sleep returned succece of fsync and maximum sleep time is 
set to 10 seconds. It does not cause bad for this problem.

> When I did that, checkpoints became even less predictable, and that was a major
> reason behind why I rejected the approach.  I think your suggestion will have the
> same problem.  You just aren't generating test cases with really large write
> workloads yet to see it.  You also don't seem afraid of how exceeding the
> checkpoint timeout is a very bad thing yet.
I think it is important that why this problem was caused. We should try to find 
the cause of which program has bug or problem.

>> In addition, if you set a large value of a checkpoint_timeout or
>> checkpoint_complete_taget, you have said that performance is improved,
>> but is it true in all the cases?
>
> The timeout, yes.  Throughput is always improved by increasing
> checkpoint_timeout.  Less checkpoints per unit of time increases efficiency.
> Less writes of the most heavy accessed buffers happen per transaction.  It is
> faster because you are doing less work, which on average is always faster than
> doing more work.  And doing less work usually beats doing more work, but doing it
> smarter.
>
> If you want to see how much work per transaction a test is doing, track the
> numbers of buffers written at the beginning/end of your test via
> pg_stat_bgwriter.  Tests that delay checkpoints will show a lower total number of
> writes per transaction.  That seems more efficient, but it's efficiency mainly
> gained by ignoring checkpoint_timeout.
OK. In next test, I will try it.

>> When a checkpoint complication target is actually enlarged,
>> performance may fall in some cases. I think this as the last fsync
>> having become heavy owing to having write in slowly.
>
> I think you're confusing throughput and latency here.  Increasing the checkpoint
> timeout, or to a lesser extent the completion target, on average that increases
> throughput.  It results in less work, and the more/less work amount is much more
> important than worrying about scheduler details.  Now matter how efficient a
> given write is, whether you've sorted it across elevator horizon boundary A or
> boundary B, it's better not do it at all.
I think fsync which has longest time or continues a lot block other transactions. 
And my patch not only improvement of throughput but also realize stable response 
time at fsync phase in checkpoint.

> By the way:  if you have a theory like "the last fsync having become heavy" for
> why something is happening, measure it.  Set log_min_messages to debug2 and
> you'll get details about every single fsync in your logs.  I did that for all my
> tests that led me to conclude fsync delaying on its own didn't help that
> problem.  I was measuring my theories as directly as possible.
OK. It's important things. And I set more detail debug log in this phase.

> I'm willing to consider an optional, sloppy checkpoint approach that uses heavy
> load to adjust how often checkpoints happen.  But if we're going to do that, it
> has to be extremely clear that the reason for the gain is the checkpoint
> spacing--and there is going to be a crash recovery time penalty paid for it.  And
> this patch is not how I would do that.
That's right. We should show that there is profit than a penalty.

> It's not really clear yet where the gains you're seeing are really coming from.
> If you re-ran all your tests with pg_stat_bgwriter before/after snapshots, logged
> every fsync call, and then build some tools to analyze the fsync call latency,
> then you'll have enough data to talk about this usefully.  That's what I consider
> the bare minimum evidence to consider changing something here.  I have all of
> those features in pgbench-tools with checkpoint logging turned way up, but
> they're not all in the dbt2 toolset yet as far as I know.
OK. I will also get /proc/meminfo each snapshot. I think OS background-write only
write in each 5 sec after 30 sec. Because dirty buffers in OS do not exceed
dirty_background_ratio in checkpoint in DBT-2. So I consider new method which is 
part of sorting and collecting write in write phase, and each sleep time is more 
long (5 sec).

And I servey about ext3 file system. My system block size is 4096, but 8192 or 
more seems to better. It will decrease number of inode and get more large 
sequential disk fields. Inode block group will be 128MB to 256MB. If you have 
test result, please tell us.

Best regards,
--
Mitsumasa KONDO
NTT Open Software Center



pgsql-hackers by date:

Previous
From: Craig Ringer
Date:
Subject: Re: improve Chinese locale performance
Next
From: Quan Zongliang
Date:
Subject: Re: improve Chinese locale performance