Home > mailing lists

Re: Spread checkpoint sync - Mailing list pgsql-hackers

From	Jeff Janes
Subject	Re: Spread checkpoint sync
Date	November 20, 2010 22:11:48
Msg-id	AANLkTimjCdQMFnjZJ_N98mD_K2Mz-_Ue7S4tY4tMC3xu@mail.gmail.com Whole thread Raw
In response to	Re: Spread checkpoint sync (Robert Haas <robertmhaas@gmail.com>)
Responses	Re: Spread checkpoint sync Re: Spread checkpoint sync
List	pgsql-hackers

Tree view

On Sat, Nov 20, 2010 at 5:17 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Nov 20, 2010 at 6:21 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

>>> Doing all the writes and then all the fsyncs meets this requirement
>>> trivially, but I'm not so sure that's a good idea.  For example, given
>>> files F1 ... Fn with dirty pages needing checkpoint writes, we could
>>> do the following: first, do any pending fsyncs for files not among F1
>>> .. Fn; then, write all pages for F1 and fsync, write all pages for F2
>>> and fsync, write all pages for F3 and fsync, etc.  This might seem
>>> dumb because we're not really giving the OS a chance to write anything
>>> out before we fsync, but think about the ext3 case where the whole
>>> filesystem cache gets flushed anyway.  It's much better to dump the
>>> cache at the beginning of the checkpoint and then again after every
>>> file than it is to spew many GB of dirty stuff into the cache and then
>>> drop the hammer.
>>
>> But the kernel has knobs to prevent that from happening.
>> dirty_background_ratio, dirty_ratio, dirty_background_bytes (on newer
>> kernels), dirty_expire_centisecs.  Don't these knobs work?  Also, ext3
>> is supposed to do a journal commit every 5 seconds under default mount
>> conditions.
>
> I don't know in detail.  dirty_expire_centisecs sounds useful; I think
> the problem with dirty_background_ratio and dirty_ratio is that the
> default ratios are large enough that on systems with a huge pile of
> memory, they allow more dirty data to accumulate than can be flushed
> without causing an I/O storm.

True, but I think that changing these from their defaults is not
considered to be a dark art reserved for kernel hackers, i.e they are
something that sysadmins are expected to tweak to suite their work
load, just like the shmmax and such.  And for very large memory
systems, even 1% may be too much to cache (dirty*_ratio can only be
set in integer percent points), so recent kernels introduced
dirty*_bytes parameters.  I like these better because they do what
they say.  With the dirty*_ratio, I could never figure out what it was
a ratio of, and the results were unpredictable without extensive
experimentation.

> I believe Greg Smith made a comment
> along the lines of - memory sizes are grow faster than I/O speeds;
> therefore a ratio that is OK for a low-end system with a modest amount
> of memory causes problems on a high-end system that has faster I/O but
> MUCH more memory.

Yes, but how much work do we want to put into redoing the checkpoint
logic so that the sysadmin on a particular OS and configuration and FS
can avoid having to change the kernel parameters away from their
defaults?  (Assuming of course I am correctly understanding the
problem, always a dangerous assumption.)

Some experiments I have just done show that dirty_expire_centisecs
does not seem reliable on ext3, but the dirty*_ratio and dirty*_bytes
seem reliable on ext2, ext3, and ext4.

But that may not apply to RAID, I don't have one I can test.

Cheers,

Jeff

pgsql-hackers by date:

From: Robert Haas
Date: 20 November 2010, 21:17:56
Subject: Re: Spread checkpoint sync

From: David Fetter
Date: 20 November 2010, 22:25:17
Subject: Re: UNNEST ... WITH ORDINALITY (AND POSSIBLY OTHER STUFF)

Re: Spread checkpoint sync - Mailing list pgsql-hackers

Previous

Next