Thread: sorted writes for checkpoints

sorted writes for checkpoints

From
Jeff Janes
Date:
One of the items on the Wiki ToDo list is sorted writes for
checkpoints.  The consensus seemed to be that this should be done by
adding hook(s) into the main code, and then a contrib module to work
with those hooks.  Is there an existing contrib module that one could
best look to for inspiration on how to go about doing this?  I have
the sorted checkpoint working under a guc, but don't know where to
start on converting it to a contrib module instead.

Cheers,

Jeff


Re: sorted writes for checkpoints

From
Alvaro Herrera
Date:
Excerpts from Jeff Janes's message of vie oct 29 00:00:24 -0300 2010:
> One of the items on the Wiki ToDo list is sorted writes for
> checkpoints.  The consensus seemed to be that this should be done by
> adding hook(s) into the main code, and then a contrib module to work
> with those hooks.  Is there an existing contrib module that one could
> best look to for inspiration on how to go about doing this?  I have
> the sorted checkpoint working under a guc, but don't know where to
> start on converting it to a contrib module instead.

Hmm, see contrib/auto_explain?

-- 
Álvaro Herrera <alvherre@commandprompt.com>
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support


Re: sorted writes for checkpoints

From
Heikki Linnakangas
Date:
On 29.10.2010 06:00, Jeff Janes wrote:
> One of the items on the Wiki ToDo list is sorted writes for
> checkpoints.  The consensus seemed to be that this should be done by
> adding hook(s) into the main code, and then a contrib module to work
> with those hooks.  Is there an existing contrib module that one could
> best look to for inspiration on how to go about doing this?  I have
> the sorted checkpoint working under a guc, but don't know where to
> start on converting it to a contrib module instead.

I don't think it's a good idea to have this as a hook. Bgwriter 
shouldn't need to load external code, and checkpoint robustness should 
dependend on user-written code. IIRC Tom Lane didn't even like pallocing 
the memory for the list of dirty pages at checkpoint time because that 
might cause an out-of-memory error. Calling a function in a contrib 
module is much much worse.

Simon's argument in the thread that the todo item points to 
(http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is 
basically that we don't know what the best algorithm is yet and 
benchmarking is a lot of work, so let's just let people do whatever they 
feel like until we settle on the best approach. I think we need to bite 
the bullet and do some benchmarking, and commit one carefully vetted 
patch to the backend.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


Re: sorted writes for checkpoints

From
Itagaki Takahiro
Date:
On Fri, Oct 29, 2010 at 3:23 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
> Simon's argument in the thread that the todo item points to
> (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is
> basically that we don't know what the best algorithm is yet and benchmarking
> is a lot of work, so let's just let people do whatever they feel like until
> we settle on the best approach. I think we need to bite the bullet and do
> some benchmarking, and commit one carefully vetted patch to the backend.

When I submitted the patch, I tested it on disk-based RAID-5 machine:
http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php
But there were no additional benchmarking reports at that time. We still
need benchmarking before we re-examine the feature. For example, SSD and
SSD-RAID was not popular at that time, but now they might be considerable.

I think direct patching to the core is enough at the first
testing, and we will decide the interface according to the
result. If one algorithm win in all cases, we could just
include it in the core, and then extensibility would not need.

-- 
Itagaki Takahiro


Re: sorted writes for checkpoints

From
Robert Haas
Date:
On Fri, Oct 29, 2010 at 2:58 AM, Itagaki Takahiro
<itagaki.takahiro@gmail.com> wrote:
> On Fri, Oct 29, 2010 at 3:23 PM, Heikki Linnakangas
> <heikki.linnakangas@enterprisedb.com> wrote:
>> Simon's argument in the thread that the todo item points to
>> (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is
>> basically that we don't know what the best algorithm is yet and benchmarking
>> is a lot of work, so let's just let people do whatever they feel like until
>> we settle on the best approach. I think we need to bite the bullet and do
>> some benchmarking, and commit one carefully vetted patch to the backend.
>
> When I submitted the patch, I tested it on disk-based RAID-5 machine:
> http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php
> But there were no additional benchmarking reports at that time. We still
> need benchmarking before we re-examine the feature. For example, SSD and
> SSD-RAID was not popular at that time, but now they might be considerable.

There are really two separate things here:

(1) trying to do all the writes to file A before you start doing
writes to file B, and
(2) trying to write out blocks to each file in ascending logical block
number order

I'm much more convinced of the value of #1 than I am of the value of
#2.  If we do #1, we can then spread out the checkpoint fsyncs in a
meaningful way without fearing that we'll need to fsync the same file
a second time for the same checkpoint.  We've gotten some pretty
specific reports of problems in this area recently, so it seems likely
that there is some value to be had there.  On the other hand, #2 is
only a win if sorting the blocks in numerical order causes the OS to
write them in a better order than it would otherwise have done.  We've
had recent reports that our block-at-a-time relation extension policy
is leading to severe fragmentation on certain filesystems, so I'm a
bit skeptical about the value of this (though, of course, that can be
overturned if we can collect meaningful evidence).

> I think direct patching to the core is enough at the first
> testing, and we will decide the interface according to the
> result. If one algorithm win in all cases, we could just
> include it in the core, and then extensibility would not need.

I agree with this, and with Heikki's remarks also.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sorted writes for checkpoints

From
Tom Lane
Date:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> Simon's argument in the thread that the todo item points to 
> (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is 
> basically that we don't know what the best algorithm is yet and 
> benchmarking is a lot of work, so let's just let people do whatever they 
> feel like until we settle on the best approach. I think we need to bite 
> the bullet and do some benchmarking, and commit one carefully vetted 
> patch to the backend.

Yeah, I tend to agree.  We've used hooks in the past to allow people to
add on non-critical functionality.  Fooling with the behavior of
checkpoints is far from noncritical.  Furthermore, it's really hard to
see what a sane hook API would even look like.  As Robert comments,
part of any win here would likely come from controlling the timing of
fsyncs, not just writes.  Controlling all that at arm's length from
the code that actually does it seems likely to be messy and inefficient.

Another point is that I don't see any groundswell of demand out there
for custom checkpoint algorithms.  If we did manage to create a hook
API, how likely is it there would ever be more than one plugin?
        regards, tom lane


Re: sorted writes for checkpoints

From
Greg Smith
Date:
Itagaki Takahiro wrote:
> When I submitted the patch, I tested it on disk-based RAID-5 machine:
> http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php
> But there were no additional benchmarking reports at that time. We still
> need benchmarking before we re-examine the feature. For example, SSD and
> SSD-RAID was not popular at that time, but now they might be considerable.
>   

I did multiple rounds of benchmarking that, just none of it showed any 
improvement so I didn't bother reporting them in detail.  I have 
recently figured out why the performance testing I did of that earlier 
patch probably failed to produce useful results on my system when I was 
testing it back then though.  It relates to trivia around how ext3 
handles fsync that's well understood now (the whole cache flushes out 
when one comes in), but wasn't back then yet.

We have a working set of patches here that both rewrite the checkpoint 
logic to avoid several larger problems with how it works now, as well as 
adding instrumentation that makes it possible to directly measure and 
graph whether methods such as sorting writes provide any improvement or 
not to the process.  My hope is to have those all ready for initial 
submission as part of CommitFest 2010-11, as the main feature addition 
from myself toward improving 9.1.

I have a bunch of background information about this I'm presenting at 
PGWest next week, after which I'll start populating the wiki with more 
details and begin packaging the code too.  I had hoped to revisit the 
checkpoint sorting details after that.  Jeff or yourself are welcome to 
try your own tests in that area, I could use the help.  But I think my 
measurement patches will help you with that considerably once I release 
them in another couple of weeks.  Seeing a graph of latency sync times 
for each file is very informative for figuring out whether a change did 
something useful, more so than just staring at total TPS results.  Such 
latency graphs are what I've recently started to do here, with some 
server-side changes that then feed into gnuplot.

The idea of making something like the sorting logic into a pluggable 
hook seems like a waste of time to me, particulary given that the 
earlier implementation really needed to be allocated a dedicated block 
of shared memory to work well IMHO (and I believe that's still the 
case).  That area isn't where the real problems are at here anyway, 
especially on large memory systems.  How the sync logic works is the 
increasingly troublesome part of the checkpoint code, because the 
problem it has to deal with grows proportionately to the size of the 
write cache on the system.  Typical production servers I deal with have 
about 8X as much RAM now as they did in 2007 when I last investigated 
write sorting.  Regular hard drives sure haven't gotten 8X faster since 
then, and battery-backed caches (which used to have enough memory to 
absorb a large portion of a checkpoint burst) have at best doubled in size.

-- 
Greg Smith, 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services and Support  www.2ndQuadrant.us




Re: sorted writes for checkpoints

From
Jeff Janes
Date:
On Fri, Oct 29, 2010 at 6:17 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 29, 2010 at 2:58 AM, Itagaki Takahiro
> <itagaki.takahiro@gmail.com> wrote:
>> On Fri, Oct 29, 2010 at 3:23 PM, Heikki Linnakangas
>> <heikki.linnakangas@enterprisedb.com> wrote:
>>> Simon's argument in the thread that the todo item points to
>>> (http://archives.postgresql.org/pgsql-patches/2008-07/msg00123.php) is
>>> basically that we don't know what the best algorithm is yet and benchmarking
>>> is a lot of work, so let's just let people do whatever they feel like until
>>> we settle on the best approach. I think we need to bite the bullet and do
>>> some benchmarking, and commit one carefully vetted patch to the backend.
>>
>> When I submitted the patch, I tested it on disk-based RAID-5 machine:
>> http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php
>> But there were no additional benchmarking reports at that time. We still
>> need benchmarking before we re-examine the feature. For example, SSD and
>> SSD-RAID was not popular at that time, but now they might be considerable.
>
> There are really two separate things here:
>
> (1) trying to do all the writes to file A before you start doing
> writes to file B, and
> (2) trying to write out blocks to each file in ascending logical block
> number order
>
> I'm much more convinced of the value of #1 than I am of the value of
> #2.  If we do #1, we can then spread out the checkpoint fsyncs in a
> meaningful way without fearing that we'll need to fsync the same file
> a second time for the same checkpoint.

If the OS/FS is behaving such that it is important to spread out
fsyncs, then wouldn't that same behavior also make it less important
to avoid fsync a second time?

If the OS is squirreling away a preposterous amount of dirty buffers
in its cache, and then panicking to dump them all when it gets the
fsync, then I think you would need to spread out the fsyncs within a
file, and not just between files.

> We've gotten some pretty
> specific reports of problems in this area recently, so it seems likely
> that there is some value to be had there.  On the other hand, #2 is
> only a win if sorting the blocks in numerical order causes the OS to
> write them in a better order than it would otherwise have done.

Assuming the ordering is useful, the only way the OS can do as good a
job as the checkpoint code can, is if the OS stores the entire
checkpoint worth of data as dirty blocks and doesn't start writing
until an fsync comes in.  This strikes me as a pathologically
configured OS/FS.  (And would explain problems with fsyncs)

> We've
> had recent reports that our block-at-a-time relation extension policy
> is leading to severe fragmentation on certain filesystems, so I'm a
> bit skeptical about the value of this (though, of course, that can be
> overturned if we can collect meaningful evidence).

Some FS are better about that than others at that.  It would probably
depend on the exact workload, and pgbench would probably favor large
contiguous extents to an unrealistic degree.  So I don't know the best
way to gather that evidence.



Cheers,

Jeff


Re: sorted writes for checkpoints

From
Robert Haas
Date:
On Sat, Nov 6, 2010 at 7:25 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>> There are really two separate things here:
>>
>> (1) trying to do all the writes to file A before you start doing
>> writes to file B, and
>> (2) trying to write out blocks to each file in ascending logical block
>> number order
>>
>> I'm much more convinced of the value of #1 than I am of the value of
>> #2.  If we do #1, we can then spread out the checkpoint fsyncs in a
>> meaningful way without fearing that we'll need to fsync the same file
>> a second time for the same checkpoint.
>
> If the OS/FS is behaving such that it is important to spread out
> fsyncs, then wouldn't that same behavior also make it less important
> to avoid fsync a second time?
>
> If the OS is squirreling away a preposterous amount of dirty buffers
> in its cache, and then panicking to dump them all when it gets the
> fsync, then I think you would need to spread out the fsyncs within a
> file, and not just between files.

Well, presumably, there's some amount of dirty data that the OS can
write out in one shot without causing a perceptible stall (otherwise
we're hosed no matter what).  The question is where that threshold is.With one fsync per file, we'll try to write at
most1 GB at once, and 
often less.  As you say, there's a possibility that that's still too
much, but right now we could be trying to dump 30+ GB to disk if the
OS has a lot of dirty pages in the buffer cache, so getting down to 1
GB or less at a time should be a big improvement even if it doesn't
solve the problem completely.

>> We've gotten some pretty
>> specific reports of problems in this area recently, so it seems likely
>> that there is some value to be had there.  On the other hand, #2 is
>> only a win if sorting the blocks in numerical order causes the OS to
>> write them in a better order than it would otherwise have done.
>
> Assuming the ordering is useful, the only way the OS can do as good a
> job as the checkpoint code can, is if the OS stores the entire
> checkpoint worth of data as dirty blocks and doesn't start writing
> until an fsync comes in.  This strikes me as a pathologically
> configured OS/FS.  (And would explain problems with fsyncs)

The OS would only need to store and reorder one file's worth of
blocks, if we wrote the data for one file and called fsync, wrote the
data for another file and called fsync, etc.

>> We've
>> had recent reports that our block-at-a-time relation extension policy
>> is leading to severe fragmentation on certain filesystems, so I'm a
>> bit skeptical about the value of this (though, of course, that can be
>> overturned if we can collect meaningful evidence).
>
> Some FS are better about that than others at that.  It would probably
> depend on the exact workload, and pgbench would probably favor large
> contiguous extents to an unrealistic degree.  So I don't know the best
> way to gather that evidence.

Well, the basic idea would be to try some different workloads on
different filesystems and try go get some feeling for how often
sorting by block number wins and how often it loses.  But as I say I
think the biggest problem is that we're often trying to write too much
dirty data to disk at once.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: sorted writes for checkpoints

From
Greg Smith
Date:
Jeff Janes wrote:

> Assuming the ordering is useful, the only way the OS can do as good a
> job as the checkpoint code can, is if the OS stores the entire
> checkpoint worth of data as dirty blocks and doesn't start writing
> until an fsync comes in.  This strikes me as a pathologically
> configured OS/FS.  (And would explain problems with fsyncs)
>   

This can be exactly the situation with ext3 on Linux, which I believe is 
one reason the write sorting patch didn't go anywhere last time it came 
up--that's certainly what I tested it on.  The slides for my talk 
"Righting Your Writes" are now up at 
http://projects.2ndquadrant.com/talks and an example showing this is on 
page 9.  I'm hoping to get the 3 patches shown in action or described in 
that talk submitted to the list before the next CommitFest.  You really 
need timing of individual sync calls to figure out what's going on here, 
and what happens is completely dependent on filesystem.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books



Re: sorted writes for checkpoints

From
Jeff Janes
Date:
On Sun, Nov 7, 2010 at 4:13 PM, Greg Smith <greg@2ndquadrant.com> wrote:
> Jeff Janes wrote:
>
>> Assuming the ordering is useful, the only way the OS can do as good a
>> job as the checkpoint code can, is if the OS stores the entire
>> checkpoint worth of data as dirty blocks and doesn't start writing
>> until an fsync comes in.  This strikes me as a pathologically
>> configured OS/FS.  (And would explain problems with fsyncs)
>>
>
> This can be exactly the situation with ext3 on Linux, which I believe is one
> reason the write sorting patch didn't go anywhere last time it came
> up--that's certainly what I tested it on.

Interesting.  I think the default mount options for ext3 is to do a
journal sync at least every 5 seconds, which should also flush out
dirty OS buffers, preventing them from building up to such an extent.
Is that default being changed here, or does it simply not work the way
I think it does?

Thanks for the link to the slides.

Cheers,

Jeff