Re: sorted writes for checkpoints - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: sorted writes for checkpoints
Date
Msg-id AANLkTimO1ia1=SzJncr97fJ=LpsEvJ3_KW0vXNFhsLBa@mail.gmail.com
Whole thread Raw
In response to Re: sorted writes for checkpoints  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: sorted writes for checkpoints
Re: sorted writes for checkpoints
List pgsql-hackers
On Fri, Oct 29, 2010 at 6:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 29, 2010 at 2:58 AM, Itagaki Takahiro
> <itagaki.takahiro@gmail.com> wrote:
>> On Fri, Oct 29, 2010 at 3:23 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>> Simon's argument in the thread that the todo item points to
>>> (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is
>>> basically that we don't know what the best algorithm is yet and benchmarking
>>> is a lot of work, so let's just let people do whatever they feel like until
>>> we settle on the best approach. I think we need to bite the bullet and do
>>> some benchmarking, and commit one carefully vetted patch to the backend.
>>
>> When I submitted the patch, I tested it on disk-based RAID-5 machine:
>> http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php
>> But there were no additional benchmarking reports at that time. We still
>> need benchmarking before we re-examine the feature. For example, SSD and
>> SSD-RAID was not popular at that time, but now they might be considerable.
>
> There are really two separate things here:
>
> (1) trying to do all the writes to file A before you start doing
> writes to file B, and
> (2) trying to write out blocks to each file in ascending logical block
> number order
>
> I'm much more convinced of the value of #1 than I am of the value of
> #2.  If we do #1, we can then spread out the checkpoint fsyncs in a
> meaningful way without fearing that we'll need to fsync the same file
> a second time for the same checkpoint.

If the OS/FS is behaving such that it is important to spread out
fsyncs, then wouldn't that same behavior also make it less important
to avoid fsync a second time?

If the OS is squirreling away a preposterous amount of dirty buffers
in its cache, and then panicking to dump them all when it gets the
fsync, then I think you would need to spread out the fsyncs within a
file, and not just between files.

> We've gotten some pretty
> specific reports of problems in this area recently, so it seems likely
> that there is some value to be had there.  On the other hand, #2 is
> only a win if sorting the blocks in numerical order causes the OS to
> write them in a better order than it would otherwise have done.

Assuming the ordering is useful, the only way the OS can do as good a
job as the checkpoint code can, is if the OS stores the entire
checkpoint worth of data as dirty blocks and doesn't start writing
until an fsync comes in.  This strikes me as a pathologically
configured OS/FS.  (And would explain problems with fsyncs)

> We've
> had recent reports that our block-at-a-time relation extension policy
> is leading to severe fragmentation on certain filesystems, so I'm a
> bit skeptical about the value of this (though, of course, that can be
> overturned if we can collect meaningful evidence).

Some FS are better about that than others at that.  It would probably
depend on the exact workload, and pgbench would probably favor large
contiguous extents to an unrealistic degree.  So I don't know the best
way to gather that evidence.



Cheers,

Jeff


pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: Simplifying replication
Next
From: Tom Lane
Date:
Subject: Re: IA64 versus effective stack limit