On Tue, Jun 25, 2013 at 1:15 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I'm not sure it's a good idea to sleep proportionally to the time it took to
> complete the previous fsync. If you have a 1GB cache in the RAID controller,
> fsyncing the a 1GB segment will fill it up. But since it fits in cache, it
> will return immediately. So we proceed fsyncing other files, until the cache
> is full and the fsync blocks. But once we fill up the cache, it's likely
> that we're hurting concurrent queries. ISTM it would be better to stay under
> that threshold, keeping the I/O system busy, but never fill up the cache
> completely.
Isn't the behavior implemented by the patch a reasonable approximation
of just that? When the fsyncs start to get slow, that's when we start
to sleep. I'll grant that it would be better to sleep when the
fsyncs are *about* to get slow, rather than when they actually have
become slow, but we have no way to know that. The only feedback we
have on how bad things are is how long it took the last fsync to
complete, so I actually think that's a much better way to go than any
fixed sleep - which will often be unnecessarily long on a well-behaved
system, and which will often be far too short on one that's having
trouble. I'm inclined to think think Kondo-san has got it right.
I like your idea of putting a stake in the ground and assuming that
the fsync phase will turn out to be X% of the checkpoint, but I wonder
if we can be a bit more sophisticated, especially for cases where
checkpoint_segments is small. When checkpoint_segments is large, then
we know that some of the data will get written back to disk during the
write phase, because the OS cache is only so big. But when it's
small, the OS will essentially do nothing during the write phase, and
then it's got to write all the data out during the fsync phase. I'm
not sure we can really model that effect thoroughly, but even
something dumb would be smarter than what we have now - e.g. use 10%,
but when checkpoint_segments < 10, use 1/checkpoint_segments. Or just
assume the fsync phase will take 30 seconds. Or ... something. I'm
not really sure what the right model is here.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company