Re: Spread checkpoint sync - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Spread checkpoint sync
Date
Msg-id AANLkTimjCdQMFnjZJ_N98mD_K2Mz-_Ue7S4tY4tMC3xu@mail.gmail.com
Whole thread Raw
In response to Re: Spread checkpoint sync  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Spread checkpoint sync
Re: Spread checkpoint sync
List pgsql-hackers
On Sat, Nov 20, 2010 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

>>> Doing all the writes and then all the fsyncs meets this requirement
>>> trivially, but I'm not so sure that's a good idea.  For example, given
>>> files F1 ... Fn with dirty pages needing checkpoint writes, we could
>>> do the following: first, do any pending fsyncs for files not among F1
>>> .. Fn; then, write all pages for F1 and fsync, write all pages for F2
>>> and fsync, write all pages for F3 and fsync, etc.  This might seem
>>> dumb because we're not really giving the OS a chance to write anything
>>> out before we fsync, but think about the ext3 case where the whole
>>> filesystem cache gets flushed anyway.  It's much better to dump the
>>> cache at the beginning of the checkpoint and then again after every
>>> file than it is to spew many GB of dirty stuff into the cache and then
>>> drop the hammer.
>>
>> But the kernel has knobs to prevent that from happening.
>> dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
>> kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
>> is supposed to do a journal commit every 5 seconds under default mount
>> conditions.
>
> I don't know in detail.  dirty_expire_centisecs sounds useful; I think
> the problem with dirty_background_ratio and dirty_ratio is that the
> default ratios are large enough that on systems with a huge pile of
> memory, they allow more dirty data to accumulate than can be flushed
> without causing an I/O storm.

True, but I think that changing these from their defaults is not
considered to be a dark art reserved for kernel hackers, i.e they are
something that sysadmins are expected to tweak to suite their work
load, just like the shmmax and such.  And for very large memory
systems, even 1% may be too much to cache (dirty*_ratio can only be
set in integer percent points), so recent kernels introduced
dirty*_bytes parameters.  I like these better because they do what
they say.  With the dirty*_ratio, I could never figure out what it was
a ratio of, and the results were unpredictable without extensive
experimentation.

> I believe Greg Smith made a comment
> along the lines of - memory sizes are grow faster than I/O speeds;
> therefore a ratio that is OK for a low-end system with a modest amount
> of memory causes problems on a high-end system that has faster I/O but
> MUCH more memory.

Yes, but how much work do we want to put into redoing the checkpoint
logic so that the sysadmin on a particular OS and configuration and FS
can avoid having to change the kernel parameters away from their
defaults?  (Assuming of course I am correctly understanding the
problem, always a dangerous assumption.)

Some experiments I have just done show that dirty_expire_centisecs
does not seem reliable on ext3, but the dirty*_ratio and dirty*_bytes
seem reliable on ext2, ext3, and ext4.

But that may not apply to RAID, I don't have one I can test.


Cheers,

Jeff


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Spread checkpoint sync
Next
From: David Fetter
Date:
Subject: Re: UNNEST ... WITH ORDINALITY (AND POSSIBLY OTHER STUFF)