Re: Improvement of checkpoint IO scheduler for stable transaction responses - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: Improvement of checkpoint IO scheduler for stable transaction responses
Date
Msg-id CAMkU=1wXRCo85AxXDRDr8-kt_=kUVNMC_WAOEyNyGRfpa8rWjA@mail.gmail.com
Whole thread Raw
In response to Re: Improvement of checkpoint IO scheduler for stable transaction responses  (Greg Smith <greg@2ndQuadrant.com>)
List pgsql-hackers
On Sunday, July 14, 2013, Greg Smith wrote:
On 7/14/13 5:28 PM, james wrote:
Some random seeks during sync can't be helped, but if they are done when
we aren't waiting for sync completion then they are in effect free.

That happens sometimes, but if you measure you'll find this doesn't actually occur usefully in the situation everyone dislikes.  In a write heavy environment where the database doesn't fit in RAM, backends and/or the background writer are constantly writing data out to the OS.  WAL is going out constantly as well, and in many cases that's competing for the disks too.

While I think it is probably true that many systems don't separate WAL from non-WAL to different IO controllers, is it true that many systems that are in need of heavy IO tuning don't do so?  I thought that that would be the first stop for any DBA of an highly IO-write constrained database.

 
 The most popular blocks in the database get high usage counts and they never leave shared_buffers except at checkpoint time. That's easy to prove to yourself with pg_buffercache.

And once the write cache fills, every I/O operation is now competing. There is nothing happening for free.  You're stealing I/O from something else any time you force a write out.  The optimal throughput path for checkpoints turns out to be delaying every single bit of I/O as long as possible, in favor of the [backend|bgwriter] writes and WAL.  Whenever you delay a buffer write, you have increased the possibility that someone else will write the same block again.   And the buffers being written by the checkpointer are, on average, the most popular ones in the database.  Writing any of them to disk pre-emptively has high odds of writing the same block more than once per checkpoint.


Should the checkpointer make multiple passes over the buffer pool, writing out the high usage_count buffers first, because no one else is going to do it, and then going back for the low usage_count buffers in the hope they were already written out?  On the other hand, if the checkpointer writes out a low-usage buffer, why would anyone else need to write it again soon?  If it were likely to get dirtied often, it wouldn't be low usage.  If it was dirtied rarely, it wouldn't be dirty anymore once written.

Cheers,

Jeff

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: pg_memory_barrier() doesn't compile, let alone work, for me
Next
From: Robert Haas
Date:
Subject: Re: pg_memory_barrier() doesn't compile, let alone work, for me