Tom Lane wrote:
>Jan Wieck <JanWieck@Yahoo.com> writes:
>
>
>>What still needs to be addressed is the IO storm cause by checkpoints. I
>>see it much relaxed when stretching out the BufferSync() over most of
>>the time until the next one should occur. But the kernel sync at it's
>>end still pushes the system hard against the wall.
>>
>>
>
>I have never been happy with the fact that we use sync(2) at all. Quite
>aside from the "I/O storm" issue, sync() is really an unsafe way to do a
>checkpoint, because there is no way to be certain when it is done. And
>on top of that, it does too much, because it forces syncing of files
>unrelated to Postgres.
>
>I would like to see us go over to fsync, or some other technique that
>gives more certainty about when the write has occurred. There might be
>some scope that way to allow stretching out the I/O, too.
>
>The main problem with this is knowing which files need to be fsync'd.
>The only idea I have come up with is to move all buffer write operations
>into a background writer process, which could easily keep track of
>every file it's written into since the last checkpoint. This could cause
>problems though if a backend wants to acquire a free buffer and there's
>none to be had --- do we want it to wait for the background process to
>do something? We could possibly say that backends may write dirty
>buffers for themselves, but only if they fsync them immediately. As
>long as this path is seldom taken, the extra fsyncs shouldn't be a big
>performance problem.
>
>Actually, once you build it this way, you could make all writes
>synchronous (open the files O_SYNC) so that there is never any need for
>explicit fsync at checkpoint time. The background writer process would
>be the one incurring the wait in most cases, and that's just fine. In
>this way you could directly control the rate at which writes are issued,
>and there's no I/O storm at all. (fsync could still cause an I/O storm
>if there's lots of pending writes in a single file.)
>
>
>
Or maybe fdatasync() would be slightly more efficient - do we care about
flushing metadata that much?
cheers
andrew