Amit Kapila <amit.kapila16@gmail.com> writes:
> On Wed, May 4, 2016 at 4:02 PM, Alex Ignatov <a.ignatov@postgrespro.ru>
> wrote:
>> On 03.05.2016 2:17, Tom Lane wrote:
>>> Writing a single sector ought to be atomic too.
>> pg_control is 8k long(i think it is legth of one page in default PG
>> compile settings).
> The actual data written is always sizeof(ControlFileData) which should be
> less than one sector.
Yes. We don't care what happens to the rest of the file as long as the
first sector's worth is updated atomically. See the comments for
PG_CONTROL_SIZE and the code in ReadControlFile/WriteControlFile.
We could change to a different PG_CONTROL_SIZE pretty easily, and there's
certainly room to argue that reducing it to 512 or 1024 would be more
efficient. I think the motivation for setting it at 8K was basically
"we're already assuming that 8K writes are efficient, so let's assume
it here too". But since the file is only written once per checkpoint,
efficiency is not really a key selling point anyway. If you could make
an argument that some other size would reduce the risk of failures,
it would be interesting --- but I suspect any such argument would be
very dependent on the quirks of a specific file system.
One point worth considering is that on most file systems, rewriting
a fraction of a page is *less* efficient than rewriting a full page,
because the kernel first has to read in the old contents to fill
the disk buffer it's going to partially overwrite with new data.
This motivates against trying to reduce the write size too much.
regards, tom lane