Thread: Controlling Load Distributed Checkpoints

Controlling Load Distributed Checkpoints

From

Heikki Linnakangas

Date:

06 June 2007, 10:19:51

I'm again looking at way the GUC variables work in load distributed 
checkpoints patch. We've discussed them a lot already, but I don't think 
they're still quite right.

Write-phase
-----------
I like the way the write-phase is controlled in general. Writes are 
throttled so that we spend the specified percentage of checkpoint 
interval doing the writes. But we always write at a specified minimum 
rate to avoid spreading out the writes unnecessarily when there's little 
work to do.

The original patch uses bgwriter_all_max_pages to set the minimum rate. 
I think we should have a separate variable, checkpoint_write_min_rate, 
in KB/s, instead.

Nap phase
---------
This is trickier. The purpose of the sleep between writes and fsyncs is 
to give the OS a chance to flush the pages to disk in it's own pace, 
hopefully limiting the affect on concurrent activity. The sleep 
shouldn't last too long, because any concurrent activity can be dirtying 
and writing more pages, and we might end up fsyncing more than necessary 
which is bad for performance. The optimal delay depends on many factors, 
but I believe it's somewhere between 0-30 seconds in any reasonable system.

In the current patch, the duration of the sleep between the write and 
sync phases is controlled as a percentage of checkpoint interval. Given 
that the optimal delay is in the range of seconds, and 
checkpoint_timeout can be up to 60 minutes, the useful values of that 
percentage would be very small, like 0.5% or even less. Furthermore, the 
optimal value doesn't depend that much on the checkpoint interval, it's 
more dependent on your OS and memory configuration.

We should therefore give the delay as a number of seconds instead of as 
a percentage of checkpoint interval.

Sync phase
----------
This is also tricky. As with the nap phase, we don't want to spend too 
much time fsyncing, because concurrent activity will write more dirty 
pages and we might just end up doing more work.

And we don't know how much work an fsync performs. The patch uses the 
file size as a measure of that, but as we discussed that doesn't 
necessarily have anything to do with reality. fsyncing a 1GB file with 
one dirty block isn't any more expensive than fsyncing a file with a 
single block.

Another problem is the granularity of an fsync. If we fsync a 1GB file 
that's full of dirty pages, we can't limit the affect on other activity. 
The best we can do is to sleep between fsyncs, but sleeping more than a 
few seconds is hardly going to be useful, no matter how bad an I/O storm 
each fsync causes.

Because of the above, I'm thinking we should ditch the 
checkpoint_sync_percentage variable, in favor of:
checkpoint_fsync_period # duration of the fsync phase, in seconds
checkpoint_fsync_delay  # max. sleep between fsyncs, in milliseconds


In all phases, the normal bgwriter activities are performed: 
lru-cleaning and switching xlog segments if archive_timeout expires. If 
a new checkpoint request arrives while the previous one is still in 
progress, we skip all the delays and finish the previous checkpoint as 
soon as possible.


GUC summary and suggested default values
----------------------------------------
checkpoint_write_percent = 50         # % of checkpoint interval to spread out 
writes
checkpoint_write_min_rate = 1000    # minimum I/O rate to write dirty 
buffers at checkpoint (KB/s)
checkpoint_nap_duration = 2         # delay between write and sync phase, in 
seconds
checkpoint_fsync_period = 30        # duration of the sync phase, in seconds
checkpoint_fsync_delay = 500        # max. delay between fsyncs

I don't like adding that many GUC variables, but I don't really see a 
way to tune them automatically. Maybe we could just hard-code the last 
one, it doesn't seem that critical, but that still leaves us 4 variables.

Thoughts?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Controlling Load Distributed Checkpoints

From

Gregory Stark

Date:

06 June 2007, 11:14:33

"Heikki Linnakangas" <heikki@enterprisedb.com> writes:

> GUC summary and suggested default values
> ----------------------------------------
> checkpoint_write_percent = 50         # % of checkpoint interval to spread out writes
> checkpoint_write_min_rate = 1000    # minimum I/O rate to write dirty
> buffers at checkpoint (KB/s)

I don't understand why this is a min_rate rather than a max_rate.

> checkpoint_nap_duration = 2         # delay between write and sync phase, in seconds

Not a comment on the choice of guc parameters, but don't we expect useful
values of this to be much closer to 30 than 0? I understand it might not be
exactly 30.

Actually, it's not so much whether there's any write traffic to the data files
during the nap that matters, it's whether there's more traffic during the nap
than during the 30s or so prior to the nap. As long as it's a steady-state
condition it shouldn't matter how long we wait, should it?

> checkpoint_fsync_period = 30        # duration of the sync phase, in seconds
> checkpoint_fsync_delay = 500        # max. delay between fsyncs

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Controlling Load Distributed Checkpoints

From

Tom Lane

Date:

06 June 2007, 12:04:08

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> GUC summary and suggested default values
> ----------------------------------------
> checkpoint_write_percent = 50         # % of checkpoint interval to spread out 
> writes
> checkpoint_write_min_rate = 1000    # minimum I/O rate to write dirty 
> buffers at checkpoint (KB/s)
> checkpoint_nap_duration = 2         # delay between write and sync phase, in 
> seconds
> checkpoint_fsync_period = 30        # duration of the sync phase, in seconds
> checkpoint_fsync_delay = 500        # max. delay between fsyncs

> I don't like adding that many GUC variables, but I don't really see a 
> way to tune them automatically.

If we don't know how to tune them, how will the users know?  Having to
add that many variables to control one feature says to me that we don't
understand the feature.

Perhaps what we need is to think about how it can auto-tune itself.
        regards, tom lane

Re: Controlling Load Distributed Checkpoints

From

Greg Smith

Date:

06 June 2007, 15:06:09

On Wed, 6 Jun 2007, Tom Lane wrote:

> If we don't know how to tune them, how will the users know?

I can tell you a good starting set for them to on a Linux system, but you 
first have to let me know how much memory is in the OS buffer cache, the 
typical I/O rate the disks can support, how many buffers are expected to 
be written out by BGW/other backends at heaviest load, and the current 
setting for /proc/sys/vm/dirty_background_ratio.  It's not a coincidence 
that there are patches applied to 8.3 or in the queue to measure all of 
the Postgres internals involved in that computation; I've been picking 
away at the edges of this problem.

Getting this sort of tuning right takes that level of information about 
the underlying system.  If there's a way to internally auto-tune the 
values this patch operates on (which I haven't found despite months of 
trying), it would be in the form of some sort of measurement/feedback loop 
based on how fast data is being written out.  There really are way too 
many things involved to try and tune it based on anything else; the 
underlying OS/hardware mechanisms that determine how this will go are 
complicated enough that it might as well be a black box for most people.

One of the things I've been fiddling with the design of is a testing 
program that simulates database activity at checkpoint time under load. 
I think running some tests like that is the most straightforward way to 
generate useful values for these tunables; it's much harder to try and 
determine them from within the backends because there's so much going on 
to keep track of.

I view the LDC mechanism as being in the same state right now as the 
background writer:  there are a lot of complicated knobs to tweak, they 
all do *something* useful for someone, and eliminating them will require a 
data-collection process across a much wider sample of data than can be 
collected quickly.  If I had to make a guess how this will end up, I'd 
expect there to be more knobs in LDC than everyone would like for the 8.3 
release, along with fairly verbose logging of what is happening at 
checkpoint time (that's why I've been nudging development in that area, 
along with making logs easier to aggregate).  Collect up enough of that 
information, then you're in a position to talk about useful automatic 
tuning--right around the 8.4 timeframe I suspect.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Controlling Load Distributed Checkpoints

From

Greg Smith

Date:

06 June 2007, 15:26:21

On Wed, 6 Jun 2007, Heikki Linnakangas wrote:

> The original patch uses bgwriter_all_max_pages to set the minimum rate. I 
> think we should have a separate variable, checkpoint_write_min_rate, in KB/s, 
> instead.

Completely agreed.  There shouldn't be any coupling with the background 
writer parameters, which may be set for a completely different set of 
priorities than the checkpoint has.  I have to look at this code again to 
see why it's a min_rate instead of a max, that seems a little weird.

> Nap phase:  We should therefore give the delay as a number of seconds 
> instead of as a percentage of checkpoint interval.

Again, the setting here should be completely decoupled from another GUC 
like the interval.  My main complaint with the original form of this patch 
was how much it tried to syncronize the process with the interval; since I 
don't even have a system where that value is set to something, because 
it's all segment based instead, that whole idea was incompatible.

The original patch tried to spread the load out as evenly as possible over 
the time available.  I much prefer thinking in terms of getting it done as 
quickly as possible while trying to bound the I/O storm.

> And we don't know how much work an fsync performs. The patch uses the file 
> size as a measure of that, but as we discussed that doesn't necessarily have 
> anything to do with reality. fsyncing a 1GB file with one dirty block isn't 
> any more expensive than fsyncing a file with a single block.

On top of that, if you have a system with a write cache, the time an fsync 
takes can greatly depend on how full it is at the time, which there is no 
way to measure or even model easily.

Is there any way to track how many dirty blocks went into each file during 
the checkpoint write?  That's your best bet for guessing how long the 
fsync will take.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Controlling Load Distributed Checkpoints

From

Heikki Linnakangas

Date:

07 June 2007, 05:37:29

Greg Smith wrote:
> On Wed, 6 Jun 2007, Heikki Linnakangas wrote:
> 
>> The original patch uses bgwriter_all_max_pages to set the minimum 
>> rate. I think we should have a separate variable, 
>> checkpoint_write_min_rate, in KB/s, instead.
> 
> Completely agreed.  There shouldn't be any coupling with the background 
> writer parameters, which may be set for a completely different set of 
> priorities than the checkpoint has.  I have to look at this code again 
> to see why it's a min_rate instead of a max, that seems a little weird.

It's min rate, because it never writes slower than that, and it can 
write faster if the next checkpoint is due soon so that we wouldn't 
finish before it's time to start the next one. (Or to be precise, before 
the next checkpoint is closer than 100-(checkpoint_write_percent)% of 
the checkpoint interval)

>> Nap phase:  We should therefore give the delay as a number of seconds 
>> instead of as a percentage of checkpoint interval.
> 
> Again, the setting here should be completely decoupled from another GUC 
> like the interval.  My main complaint with the original form of this 
> patch was how much it tried to syncronize the process with the interval; 
> since I don't even have a system where that value is set to something, 
> because it's all segment based instead, that whole idea was incompatible.

checkpoint_segments is taken into account as well as checkpoint_timeout. 
I used the term "checkpoint interval" to mean the real interval at which 
the checkpoints occur, whether it's because of segments or timeout.

> The original patch tried to spread the load out as evenly as possible 
> over the time available.  I much prefer thinking in terms of getting it 
> done as quickly as possible while trying to bound the I/O storm.

Yeah, the checkpoint_min_rate allows you to do that.

So there's two extreme ways you can use LDC:
1. Finish the checkpoint as soon as possible, without disturbing other 
activity too much. Set checkpoint_write_percent to a high number, and 
set checkpoint_min_rate to define "too much".
2. Disturb other activity as little as possible, as long as the 
checkpoint finishes in a reasonable time. Set checkpoint_min_rate to a 
low number, and checkpoint_write_percent to define "reasonable time"

Are both interesting use cases, or is it enough to cater for just one of 
them? I think 2 is easier to tune. Defining the min_rate properly can be 
difficult and depends a lot on your hardware and application, but a 
default value of say 50% for checkpoint_write_percent to tune for use 
case 2 should work pretty well for most people.

In any case, the checkpoint better finish before it's time to start 
another one. Or would you rather delay the next checkpoint, and let 
checkpoint take as long as it takes to finish at the min_rate?

>> And we don't know how much work an fsync performs. The patch uses the 
>> file size as a measure of that, but as we discussed that doesn't 
>> necessarily have anything to do with reality. fsyncing a 1GB file with 
>> one dirty block isn't any more expensive than fsyncing a file with a 
>> single block.
> 
> On top of that, if you have a system with a write cache, the time an 
> fsync takes can greatly depend on how full it is at the time, which 
> there is no way to measure or even model easily.
> 
> Is there any way to track how many dirty blocks went into each file 
> during the checkpoint write?  That's your best bet for guessing how long 
> the fsync will take.

I suppose it's possible, but the OS has hopefully started flushing them 
to disk almost as soon as we started the writes, so even that isn't very 
good a measure.

On a Linux system, one way to model it is that the OS flushes dirty 
buffers to disk at the same rate as we write them, but delayed by 
dirty_expire_centisecs. That should hold if the writes are spread out 
enough. Then the amount of dirty buffers in OS cache at the end of write 
phase is roughly constant, as long as the write phase lasts longer than 
dirty_expire_centisecs. If we take a nap of dirty_expire_centisecs after 
the write phase, the fsyncs should be effectively no-ops, except that 
they will flush any other writes the bgwriter lru-sweep and other 
backends performed during the nap.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Controlling Load Distributed Checkpoints

From

Hannu Krosing

Date:

07 June 2007, 06:28:13

Ühel kenal päeval, K, 2007-06-06 kell 11:03, kirjutas Tom Lane:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > GUC summary and suggested default values
> > ----------------------------------------
> > checkpoint_write_percent = 50         # % of checkpoint interval to spread out 
> > writes
> > checkpoint_write_min_rate = 1000    # minimum I/O rate to write dirty 
> > buffers at checkpoint (KB/s)
> > checkpoint_nap_duration = 2         # delay between write and sync phase, in 
> > seconds
> > checkpoint_fsync_period = 30        # duration of the sync phase, in seconds
> > checkpoint_fsync_delay = 500        # max. delay between fsyncs
> 
> > I don't like adding that many GUC variables, but I don't really see a 
> > way to tune them automatically.
> 
> If we don't know how to tune them, how will the users know?  

He talked about doing it _automatically_.

If the knobns are available, it will be possible to determine "good"
values even by brute-force performance testing, given enough time and
manpower is available.

> Having to
> add that many variables to control one feature says to me that we don't
> understand the feature.

The feature has lots of complex dependencies to things outside postgres,
so learning to understand it takes time. Having the knows available
helps as more people ar willing to do turn-the-knobs-and-test vs.
recompile-and-test.

> Perhaps what we need is to think about how it can auto-tune itself.

Sure.

-------------------
Hannu Krosing

Re: Controlling Load Distributed Checkpoints

From

Heikki Linnakangas

Date:

07 June 2007, 09:26:42

Thinking about this whole idea a bit more, it occured to me that the
current approach to write all, then fsync all is really a historical
artifact of the fact that we used to use the system-wide sync call
instead of fsyncs to flush the pages to disk. That might not be the best
way to do things in the new load-distributed-checkpoint world.

How about interleaving the writes with the fsyncs?

1.
Scan all shared buffers, and build a list of all files with dirty pages,
and buffers belonging to them

2.
foreach(file in list)
{ foreach(buffer belonging to file) { write(); sleep(); /* to throttle the I/O rate */ } sleep(); /* to give
theOS a chance to flush the writes at it's own

pace */ fsync()
}

This would spread out the fsyncs in a natural way, making the knob to
control the duration of the sync phase unnecessary.

At some point we'll also need to fsync all files that have been modified
since the last checkpoint, but don't have any dirty buffers in the
buffer cache. I think it's a reasonable assumption that fsyncing those
files doesn't generate a lot of I/O. Since the writes have been made
some time ago, the OS has likely already flushed them to disk.

Doing the 1st phase of just scanning the buffers to see which ones are
dirty also effectively implements the optimization of not writing
buffers that were dirtied after the checkpoint start. And grouping the
writes per file gives the OS a better chance to group the physical writes.

One problem is that currently the segmentation of relations to 1GB files
is handled at a low level inside md.c, and we don't really have any
visibility into that in the buffer manager. ISTM that some changes to
the smgr interfaces would be needed for this to work well, though just
doing it on a relation per relation basis would also be better than the
current approach.

-- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com

Re: Controlling Load Distributed Checkpoints

From

Tom Lane

Date:

07 June 2007, 11:16:39

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Thinking about this whole idea a bit more, it occured to me that the 
> current approach to write all, then fsync all is really a historical 
> artifact of the fact that we used to use the system-wide sync call 
> instead of fsyncs to flush the pages to disk. That might not be the best 
> way to do things in the new load-distributed-checkpoint world.

> How about interleaving the writes with the fsyncs?

I don't think it's a historical artifact at all: it's a valid reflection
of the fact that we don't know enough about disk layout to do low-level
I/O scheduling.  Issuing more fsyncs than necessary will do little
except guarantee a less-than-optimal scheduling of the writes.
        regards, tom lane

Re: Controlling Load Distributed Checkpoints

From

Heikki Linnakangas

Date:

07 June 2007, 14:24:32

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Thinking about this whole idea a bit more, it occured to me that the 
>> current approach to write all, then fsync all is really a historical 
>> artifact of the fact that we used to use the system-wide sync call 
>> instead of fsyncs to flush the pages to disk. That might not be the best 
>> way to do things in the new load-distributed-checkpoint world.
> 
>> How about interleaving the writes with the fsyncs?
> 
> I don't think it's a historical artifact at all: it's a valid reflection
> of the fact that we don't know enough about disk layout to do low-level
> I/O scheduling.  Issuing more fsyncs than necessary will do little
> except guarantee a less-than-optimal scheduling of the writes.

I'm not proposing to issue any more fsyncs. I'm proposing to change the 
ordering so that instead of first writing all dirty buffers and then 
fsyncing all files, we'd write all buffers belonging to a file, fsync 
that file only, then write all buffers belonging to next file, fsync, 
and so forth.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Controlling Load Distributed Checkpoints

From

Tom Lane

Date:

07 June 2007, 14:44:01

Heikki Linnakangas <heikki@enterprisedb.com> writes:
> Tom Lane wrote:
>> I don't think it's a historical artifact at all: it's a valid reflection
>> of the fact that we don't know enough about disk layout to do low-level
>> I/O scheduling.  Issuing more fsyncs than necessary will do little
>> except guarantee a less-than-optimal scheduling of the writes.

> I'm not proposing to issue any more fsyncs. I'm proposing to change the 
> ordering so that instead of first writing all dirty buffers and then 
> fsyncing all files, we'd write all buffers belonging to a file, fsync 
> that file only, then write all buffers belonging to next file, fsync, 
> and so forth.

But that means that the I/O to different files cannot be overlapped by
the kernel, even if it would be more efficient to do so.
        regards, tom lane

Re: Controlling Load Distributed Checkpoints

From

Heikki Linnakangas

Date:

07 June 2007, 15:00:13

Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> Tom Lane wrote:
>>> I don't think it's a historical artifact at all: it's a valid reflection
>>> of the fact that we don't know enough about disk layout to do low-level
>>> I/O scheduling.  Issuing more fsyncs than necessary will do little
>>> except guarantee a less-than-optimal scheduling of the writes.
> 
>> I'm not proposing to issue any more fsyncs. I'm proposing to change the 
>> ordering so that instead of first writing all dirty buffers and then 
>> fsyncing all files, we'd write all buffers belonging to a file, fsync 
>> that file only, then write all buffers belonging to next file, fsync, 
>> and so forth.
> 
> But that means that the I/O to different files cannot be overlapped by
> the kernel, even if it would be more efficient to do so.

True. On the other hand, if we issue writes in essentially random order, 
we might fill the kernel buffers with random blocks and the kernel needs 
to flush them to disk as almost random I/O. If we did the writes in 
groups, the kernel has better chance at coalescing them.

I tend to agree that if the goal is to finish the checkpoint as quickly 
as possible, the current approach is better. In the context of load 
distributed checkpoints, however, it's unlikely the kernel can do any 
significant overlapping since we're trickling the writes anyway.

Do we need both strategies?

I'm starting to feel we should give up on smoothing the fsyncs and 
distribute the writes only, for 8.3. As we get more experience with that 
and it's shortcomings, we can enhance our checkpoints further in 8.4.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Controlling Load Distributed Checkpoints

From

Greg Smith

Date:

07 June 2007, 15:58:57

On Thu, 7 Jun 2007, Heikki Linnakangas wrote:

> So there's two extreme ways you can use LDC:
> 1. Finish the checkpoint as soon as possible, without disturbing other 
> activity too much
> 2. Disturb other activity as little as possible, as long as the 
> checkpoint finishes in a reasonable time.
> Are both interesting use cases, or is it enough to cater for just one of 
> them? I think 2 is easier to tune.

The motivation for the (1) case is that you've got a system that's 
dirtying the buffer cache very fast in normal use, where even the 
background writer is hard pressed to keep the buffer pool clean.  The 
checkpoint is the most powerful and efficient way to clean up many dirty 
buffers out of such a buffer cache in a short period of time so that 
you're back to having room to work in again.  In that situation, since 
there are many buffers to write out, you'll also be suffering greatly from 
fsync pauses.  Being able to synchronize writes a little better with the 
underlying OS to smooth those out is a huge help.

I'm completely biased because of the workloads I've been dealing with 
recently, but I consider (2) so much easier to tune for that it's barely 
worth worrying about.  If your system is so underloaded that you can let 
the checkpoints take their own sweet time, I'd ask if you have enough 
going on that you're suffering very much from checkpoint performance 
issues anyway.  I'm used to being in a situation where if you don't push 
out checkpoint data as fast as physically possible, you end up fighting 
with the client backends for write bandwidth once the LRU point moves past 
where the checkpoint has written out to already.  I'm not sure how much 
always running the LRU background writer will improve that situation.

> On a Linux system, one way to model it is that the OS flushes dirty buffers 
> to disk at the same rate as we write them, but delayed by 
> dirty_expire_centisecs. That should hold if the writes are spread out enough.

If they're really spread out, sure.  There is congestion avoidance code 
inside the Linux kernel that makes dirty_expire_centisecs not quite work 
the way it is described under load.  All you can say in the general case 
is that when dirty_expire_centisecs has passed, the kernel badly wants to 
write the buffers out as quickly as possible; that could still be many 
seconds after the expiration time on a busy system, or on one with slow 
I/O.

On every system I've ever played with Postgres write performance on, I 
discovered that the memory-based parameters like dirty_background_ratio 
were really driving write behavior, and I almost ignore the expire timeout 
now.  Plotting the "Dirty:" value in /proc/meminfo as you're running tests 
is extremely informative for figuring out what Linux is really doing 
underneath the database writes.

The influence of the congestion code is why I made the comment about 
watching how long writes are taking to gauge how fast you can dump data 
onto the disks.  When you're suffering from one of the congestion 
mechanisms, the initial writes start blocking, even before the fsync. 
That behavior is almost undocumented outside of the relevant kernel source 
code.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Controlling Load Distributed Checkpoints

From

Gregory Stark

Date:

07 June 2007, 16:28:37

"Greg Smith" <gsmith@gregsmith.com> writes:

> I'm completely biased because of the workloads I've been dealing with recently,
> but I consider (2) so much easier to tune for that it's barely worth worrying
> about.  If your system is so underloaded that you can let the checkpoints take
> their own sweet time, I'd ask if you have enough going on that you're suffering
> very much from checkpoint performance issues anyway.  I'm used to being in a
> situation where if you don't push out checkpoint data as fast as physically
> possible, you end up fighting with the client backends for write bandwidth once
> the LRU point moves past where the checkpoint has written out to already.  I'm
> not sure how much always running the LRU background writer will improve that
> situation.

I think you're working from a faulty premise.

There's no relationship between the volume of writes and how important the
speed of checkpoint is. In either scenario you should assume a system that is
close to the max i/o bandwidth. The only question is which task the admin
would prefer take the hit for maxing out the bandwidth, the transactions or
the checkpoint.

You seem to have imagined that letting the checkpoint take longer will slow
down transactions. In fact that's precisely the effect we're trying to avoid.
Right now we're seeing tests where Postgres stops handling *any* transactions
for up to a minute. In virtually any real world scenario that would simply be
unacceptable.

That one-minute outage is a direct consequence of trying to finish the
checkpoint as quick as possible. If we spread it out then it might increase
the average i/o load if you sum it up over time, but then you just need a
faster i/o controller. 

The only scenario where you would prefer the absolute lowest i/o rate summed
over time would be if you were close to maxing out your i/o bandwidth,
couldn't buy a faster controller, and response time was not a factor, only
sheer volume of transactions processed mattered. That's a much less common
scenario than caring about the response time.

The flip side of having to worry about response time buying a faster
controller doesn't even help. It would shorten the duration of the checkpoint
but not eliminate it. A 30-second outage every half hour is just as
unacceptable as a 1-minute outage every half hour.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Controlling Load Distributed Checkpoints

From

Greg Smith

Date:

07 June 2007, 17:49:26

On Thu, 7 Jun 2007, Gregory Stark wrote:

> You seem to have imagined that letting the checkpoint take longer will slow
> down transactions.

And you seem to have imagined that I have so much spare time that I'm just 
making stuff up to entertain myself and sow confusion.

I observed some situations where delaying checkpoints too long ends up 
slowing down both transaction rate and response time, using earlier 
variants of the LDC patch and code with similar principles I wrote.  I'm 
trying to keep the approach used here out of the worst of the corner cases 
I ran into, or least to make it possible for people in those situations to 
have some ability to tune out of the bad spots.  I am unfortunately not 
free to disclose all those test results, and since that project is over I 
can't see how the current LDC compares to what I tested at the time.

I plainly stated I had a bias here, one that's not even close to the 
average case.  My concern here was that Heikki would end up optimizing in 
a direction where a really wide spread across the active checkpoint 
interval was strongly preferred.  I wanted to offer some suggestions on 
the type of situation where that might not be true, but where a different 
tuning of LDC would still be an improvement over the current behavior. 
There are some tuning knobs there that I don't want to see go away until 
there's been a wider range of tests to prove they aren't effective.

> Right now we're seeing tests where Postgres stops handling *any* transactions
> for up to a minute. In virtually any real world scenario that would simply be
> unacceptable.

No doubt; I've seen things get close to that bad myself, both on the high 
and low end. I collided with the issue in a situation of "maxing out your 
i/o bandwidth, couldn't buy a faster controller" at one point, which is 
what kicked off my working in this area.  It turned out there were still 
some software tunables left that pulled the worst case down to the 2-5 
second range instead.  With more checkpoint_segments to decrease the 
frequency, that was just enough to make the problem annoying rather than 
crippling.  But after that, I could easily imagine a different application 
scenario where the behavior you describe is the best case.

This is really a serious issue with the current design of the database, 
one that merely changes instead of going away completely if you throw more 
hardware at it.  I'm perversely glad to hear this is torturing more people 
than just me as it improves the odds the situation will improve.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Controlling Load Distributed Checkpoints

From

"Joshua D. Drake"

Date:

07 June 2007, 17:55:31

> This is really a serious issue with the current design of the database, 
> one that merely changes instead of going away completely if you throw 
> more hardware at it.  I'm perversely glad to hear this is torturing more 
> people than just me as it improves the odds the situation will improve.

It tortures pretty much any high velocity postgresql db of which there 
are more and more every day.

Joshua D. Drake


> 
> -- 
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
> 


-- 
      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

Re: .conf File Organization WAS: Controlling Load Distributed Checkpoints

From

Josh Berkus

Date:

07 June 2007, 21:34:44

All,

This brings up another point.   With the increased number of .conf 
options, the file is getting hard to read again.  I'd like to do another 
reorganization, but I don't really want to break people's diff scripts.  Should I worry about that?

--Josh

Re: .conf File Organization WAS: Controlling Load Distributed Checkpoints

From

"Joshua D. Drake"

Date:

07 June 2007, 21:42:33

Josh Berkus wrote:
> All,
> 
> This brings up another point.   With the increased number of .conf 
> options, the file is getting hard to read again.  I'd like to do another 
> reorganization, but I don't really want to break people's diff scripts. 
>  Should I worry about that?

As a point of feedback, autovacuum and vacuum should be together.

Joshua D. Drake


> 
> --Josh
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>               http://archives.postgresql.org
> 


-- 
      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive  PostgreSQL solutions since 1997             http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

Re: .conf File Organization WAS: Controlling Load Distributed Checkpoints

From

Tom Lane

Date:

07 June 2007, 22:19:58

Josh Berkus <josh@agliodbs.com> writes:
> This brings up another point.   With the increased number of .conf 
> options, the file is getting hard to read again.  I'd like to do another 
> reorganization, but I don't really want to break people's diff scripts. 

Do you have a better organizing principle than what's there now?
        regards, tom lane

Re: Controlling Load Distributed Checkpoints

From

Heikki Linnakangas

Date:

08 June 2007, 05:51:27

Greg Smith wrote:
> On Thu, 7 Jun 2007, Heikki Linnakangas wrote:
> 
>> So there's two extreme ways you can use LDC:
>> 1. Finish the checkpoint as soon as possible, without disturbing other 
>> activity too much
>> 2. Disturb other activity as little as possible, as long as the 
>> checkpoint finishes in a reasonable time.
>> Are both interesting use cases, or is it enough to cater for just one 
>> of them? I think 2 is easier to tune.
> 
> The motivation for the (1) case is that you've got a system that's 
> dirtying the buffer cache very fast in normal use, where even the 
> background writer is hard pressed to keep the buffer pool clean.  The 
> checkpoint is the most powerful and efficient way to clean up many dirty 
> buffers out of such a buffer cache in a short period of time so that 
> you're back to having room to work in again.  In that situation, since 
> there are many buffers to write out, you'll also be suffering greatly 
> from fsync pauses.  Being able to synchronize writes a little better 
> with the underlying OS to smooth those out is a huge help.

ISTM the bgwriter just isn't working hard enough in that scenario. 
Assuming we get the lru autotuning patch in 8.3, do you think there's 
still merit in using the checkpoints that way?

> I'm completely biased because of the workloads I've been dealing with 
> recently, but I consider (2) so much easier to tune for that it's barely 
> worth worrying about.  If your system is so underloaded that you can let 
> the checkpoints take their own sweet time, I'd ask if you have enough 
> going on that you're suffering very much from checkpoint performance 
> issues anyway.  I'm used to being in a situation where if you don't push 
> out checkpoint data as fast as physically possible, you end up fighting 
> with the client backends for write bandwidth once the LRU point moves 
> past where the checkpoint has written out to already.  I'm not sure how 
> much always running the LRU background writer will improve that situation.

I'd think it eliminates the problem. Assuming we keep the LRU cleaning 
running as usual, I don't see how writing faster during checkpoints 
could ever be beneficial for concurrent activity. The more you write, 
the less bandwidth there's available for others.

Doing the checkpoint as quickly as possible might be slightly better for 
average throughput, but that's a different matter.

> On every system I've ever played with Postgres write performance on, I 
> discovered that the memory-based parameters like dirty_background_ratio 
> were really driving write behavior, and I almost ignore the expire 
> timeout now.  Plotting the "Dirty:" value in /proc/meminfo as you're 
> running tests is extremely informative for figuring out what Linux is 
> really doing underneath the database writes.

Interesting. I haven't touched any of the kernel parameters yet in my 
tests. It seems we need to try different parameters and see how the 
dynamics change. But we must also keep in mind that average DBA doesn't 
change any settings, and might not even be able or allowed to. That 
means the defaults should work reasonably well without tweaking the OS 
settings.

> The influence of the congestion code is why I made the comment about 
> watching how long writes are taking to gauge how fast you can dump data 
> onto the disks.  When you're suffering from one of the congestion 
> mechanisms, the initial writes start blocking, even before the fsync. 
> That behavior is almost undocumented outside of the relevant kernel 
> source code.

Yeah, that's controlled by dirty_ratio, if I've understood the 
parameters correctly. If we spread out the writes enough, we shouldn't 
hit that limit or congestion. That's the point of the patch.

Do you have time / resources to do testing? You've clearly spent a lot 
of time on this, and I'd be very interested to see some actual numbers 
from your tests with various settings.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Controlling Load Distributed Checkpoints

From

Andrew Sullivan

Date:

08 June 2007, 11:10:59

On Fri, Jun 08, 2007 at 09:50:49AM +0100, Heikki Linnakangas wrote:

> dynamics change. But we must also keep in mind that average DBA doesn't 
> change any settings, and might not even be able or allowed to. That 
> means the defaults should work reasonably well without tweaking the OS 
> settings.

Do you mean "change the OS settings" or something else?  (I'm not
sure it's true in any case, because shared memory kernel settings
have to be fiddled with in many instances, but I thought I'd ask for
clarification.)

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
Users never remark, "Wow, this software may be buggy and hard 
to use, but at least there is a lot of code underneath."    --Damien Katz

Re: Controlling Load Distributed Checkpoints

From

Heikki Linnakangas

Date:

08 June 2007, 11:21:43

Andrew Sullivan wrote:
> On Fri, Jun 08, 2007 at 09:50:49AM +0100, Heikki Linnakangas wrote:
> 
>> dynamics change. But we must also keep in mind that average DBA doesn't 
>> change any settings, and might not even be able or allowed to. That 
>> means the defaults should work reasonably well without tweaking the OS 
>> settings.
> 
> Do you mean "change the OS settings" or something else?  (I'm not
> sure it's true in any case, because shared memory kernel settings
> have to be fiddled with in many instances, but I thought I'd ask for
> clarification.)

Yes, that's what I meant. An average DBA is not likely to change OS 
settings.

You're right on the shmmax setting, though.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Controlling Load Distributed Checkpoints

From

Greg Smith

Date:

08 June 2007, 11:35:20

On Fri, 8 Jun 2007, Andrew Sullivan wrote:

> Do you mean "change the OS settings" or something else?  (I'm not
> sure it's true in any case, because shared memory kernel settings
> have to be fiddled with in many instances, but I thought I'd ask for
> clarification.)

In a situation where a hosting provider of some sort is providing 
PostgreSQL, they should know that parameters like SHMMAX need to be 
increased before customers can create a larger installation.  You'd expect 
they'd take care of that as part of routine server setup.  What wouldn't 
be reasonable is to expect them to tune obscure parts of the kernel just 
for your application.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Controlling Load Distributed Checkpoints

From

Andrew Sullivan

Date:

08 June 2007, 12:06:29

On Fri, Jun 08, 2007 at 10:33:50AM -0400, Greg Smith wrote:
> they'd take care of that as part of routine server setup.  What wouldn't 
> be reasonable is to expect them to tune obscure parts of the kernel just 
> for your application.

Well, I suppose it'd depend on what kind of hosting environment
you're in (if I'm paying for dedicated hosting, you better believe
I'm going to insist they tune the kernel the way I want), but you're
right that in shared hosting for $25/mo, it's not going to happen.

A

-- 
Andrew Sullivan  | ajs@crankycanuck.ca
"The year's penultimate month" is not in truth a good way of saying
November.    --H.W. Fowler

Re: Controlling Load Distributed Checkpoints

From

Bruce Momjian

Date:

08 June 2007, 15:36:59

Andrew Sullivan wrote:
> On Fri, Jun 08, 2007 at 10:33:50AM -0400, Greg Smith wrote:
> > they'd take care of that as part of routine server setup.  What wouldn't 
> > be reasonable is to expect them to tune obscure parts of the kernel just 
> > for your application.
> 
> Well, I suppose it'd depend on what kind of hosting environment
> you're in (if I'm paying for dedicated hosting, you better believe
> I'm going to insist they tune the kernel the way I want), but you're
> right that in shared hosting for $25/mo, it's not going to happen.

And consider other operating systems that don't have the same knobs.  We
should tune as best we can first without kernel knobs.

--  Bruce Momjian  <bruce@momjian.us>          http://momjian.us EnterpriseDB
http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Controlling Load Distributed Checkpoints

From

"Jim C. Nasby"

Date:

09 June 2007, 04:39:50

On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
> > Thinking about this whole idea a bit more, it occured to me that the
> > current approach to write all, then fsync all is really a historical
> > artifact of the fact that we used to use the system-wide sync call
> > instead of fsyncs to flush the pages to disk. That might not be the best
> > way to do things in the new load-distributed-checkpoint world.
>
> > How about interleaving the writes with the fsyncs?
>
> I don't think it's a historical artifact at all: it's a valid reflection
> of the fact that we don't know enough about disk layout to do low-level
> I/O scheduling.  Issuing more fsyncs than necessary will do little
> except guarantee a less-than-optimal scheduling of the writes.

If we extended relations by more than 8k at a time, we would know a lot
more about disk layout, at least on filesystems with a decent amount of
free space.
--
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Controlling Load Distributed Checkpoints

From

Heikki Linnakangas

Date:

10 June 2007, 16:50:08

Jim C. Nasby wrote:
> On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:
>> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>>> Thinking about this whole idea a bit more, it occured to me that the 
>>> current approach to write all, then fsync all is really a historical 
>>> artifact of the fact that we used to use the system-wide sync call 
>>> instead of fsyncs to flush the pages to disk. That might not be the best 
>>> way to do things in the new load-distributed-checkpoint world.
>>> How about interleaving the writes with the fsyncs?
>> I don't think it's a historical artifact at all: it's a valid reflection
>> of the fact that we don't know enough about disk layout to do low-level
>> I/O scheduling.  Issuing more fsyncs than necessary will do little
>> except guarantee a less-than-optimal scheduling of the writes.
> 
> If we extended relations by more than 8k at a time, we would know a lot
> more about disk layout, at least on filesystems with a decent amount of
> free space.

I doubt it makes that much difference. If there was a significant amount 
of fragmentation, we'd hear more complaints about seq scan performance.

The issue here is that we don't know which relations are on which drives 
and controllers, how they're striped, mirrored etc.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Controlling Load Distributed Checkpoints

From

ITAGAKI Takahiro

Date:

11 June 2007, 03:27:45

Heikki Linnakangas <heikki@enterprisedb.com> wrote:

> True. On the other hand, if we issue writes in essentially random order, 
> we might fill the kernel buffers with random blocks and the kernel needs 
> to flush them to disk as almost random I/O. If we did the writes in 
> groups, the kernel has better chance at coalescing them.

If the kernel can treat sequential writes better than random writes, 
is it worth sorting dirty buffers in block order per file at the start
of checkpoints? Here is the pseudo code:
 buffers_to_be_written =     SELECT buf_id, tag FROM BufferDescriptors       WHERE (flags & BM_DIRTY) != 0 ORDER BY
tag.rnode,tag.blockNum; for { buf_id, tag } in buffers_to_be_written:     if BufferDescriptors[buf_id].tag == tag:
  FlushBuffer(&BufferDescriptors[buf_id])

We can also avoid writing buffers newly dirtied after the checkpoint was
started with this method.

> I tend to agree that if the goal is to finish the checkpoint as quickly 
> as possible, the current approach is better. In the context of load 
> distributed checkpoints, however, it's unlikely the kernel can do any 
> significant overlapping since we're trickling the writes anyway.

Some kernels or storage subsystems treat all I/Os too fairly so that user
transactions waiting for reads are blocked by checkpoints writes. It is
unavoidable behavior though, but we can split writes in small batches.

> I'm starting to feel we should give up on smoothing the fsyncs and 
> distribute the writes only, for 8.3. As we get more experience with that 
> and it's shortcomings, we can enhance our checkpoints further in 8.4.

I agree with the only writes distribution for 8.3. The new parameters
introduced by it (checkpoint_write_percent and checkpoint_write_min_rate)
will continue to be alive without major changes in the future, but other
parameters seem to be volatile.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Controlling Load Distributed Checkpoints

From

Greg Smith

Date:

11 June 2007, 04:51:56

On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:

> If the kernel can treat sequential writes better than random writes, is 
> it worth sorting dirty buffers in block order per file at the start of 
> checkpoints?

I think it has the potential to improve things.  There are three obvious 
and one subtle argument against it I can think of:

1) Extra complexity for something that may not help.  This would need some 
good, robust benchmarking improvements to justify its use.

2) Block number ordering may not reflect actual order on disk.  While 
true, it's got to be better correlated with it than writing at random.

3) The OS disk elevator should be dealing with this issue, particularly 
because it may really know the actual disk ordering.

Here's the subtle thing:  by writing in the same order the LRU scan occurs 
in, you are writing dirty buffers in the optimal fashion to eliminate 
client backend writes during BuferAlloc.  This makes the checkpoint a 
really effective LRU clearing mechanism.  Writing in block order will 
change that.

I spent some time trying to optimize the elevator part of this operation, 
since I knew that on the system I was using block order was actual order. 
I found that under Linux, the behavior of the pdflush daemon that manages 
dirty memory had a more serious impact on writing behavior at checkpoint 
time than playing with the elevator scheduling method did.  The way 
pdflush works actually has several interesting implications for how to 
optimize this patch.  For example, how writes get blocked when the dirty 
memory reaches certain thresholds means that you may not get the full 
benefit of the disk elevator at checkpoint time the way most would expect.

Since much of that was basically undocumented, I had to write my own 
analysis of the actual workings, which is now available at 
http://www.westnet.com/~gsmith/content/linux-pdflush.htm  I hope that 
anyone who wants more information about how Linux kernel parameters like 
dirty_background_ratio actually work, and how they impact the writing 
strategy, should find that article uniquely helpful.

> Some kernels or storage subsystems treat all I/Os too fairly so that 
> user transactions waiting for reads are blocked by checkpoints writes.

In addition to that (which I've seen happen quite a bit), in the Linux 
case another fairness issue is that the code that handles writes allows a 
single process writing a lot of data to block writes for everyone else. 
That means that in addition to being blocked on actual reads, if a client 
backend starts a write in order to complete a buffer allocation to hold 
new information, that can grind to a halt because of the checkpoint 
process as well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Controlling Load Distributed Checkpoints

From

Heikki Linnakangas

Date:

11 June 2007, 06:28:04

ITAGAKI Takahiro wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> wrote:
> 
>> True. On the other hand, if we issue writes in essentially random order, 
>> we might fill the kernel buffers with random blocks and the kernel needs 
>> to flush them to disk as almost random I/O. If we did the writes in 
>> groups, the kernel has better chance at coalescing them.
> 
> If the kernel can treat sequential writes better than random writes, 
> is it worth sorting dirty buffers in block order per file at the start
> of checkpoints? Here is the pseudo code:
> 
>   buffers_to_be_written =
>       SELECT buf_id, tag FROM BufferDescriptors
>         WHERE (flags & BM_DIRTY) != 0 ORDER BY tag.rnode, tag.blockNum;
>   for { buf_id, tag } in buffers_to_be_written:
>       if BufferDescriptors[buf_id].tag == tag:
>           FlushBuffer(&BufferDescriptors[buf_id])
> 
> We can also avoid writing buffers newly dirtied after the checkpoint was
> started with this method.

That's worth testing, IMO. Probably won't happen for 8.3, though.

>> I tend to agree that if the goal is to finish the checkpoint as quickly 
>> as possible, the current approach is better. In the context of load 
>> distributed checkpoints, however, it's unlikely the kernel can do any 
>> significant overlapping since we're trickling the writes anyway.
> 
> Some kernels or storage subsystems treat all I/Os too fairly so that user
> transactions waiting for reads are blocked by checkpoints writes. It is
> unavoidable behavior though, but we can split writes in small batches.

That's really the heart of our problems. If the kernel had support for 
prioritizing the normal backend activity and LRU cleaning over the 
checkpoint I/O, we wouldn't need to throttle the I/O ourselves. The 
kernel has the best knowledge of what it can and can't do, and how busy 
the I/O subsystems are. Recent Linux kernels have some support for read 
I/O priorities, but not for writes.

I believe the best long term solution is to add that support to the 
kernel, but it's going to take a long time until that's universally 
available, and we have a lot of platforms to support.

>> I'm starting to feel we should give up on smoothing the fsyncs and 
>> distribute the writes only, for 8.3. As we get more experience with that 
>> and it's shortcomings, we can enhance our checkpoints further in 8.4.
> 
> I agree with the only writes distribution for 8.3. The new parameters
> introduced by it (checkpoint_write_percent and checkpoint_write_min_rate)
> will continue to be alive without major changes in the future, but other
> parameters seem to be volatile.

I'm going to start testing with just distributing the writes. Let's see 
how far that gets us.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: .conf File Organization

From

Josh Berkus

Date:

12 June 2007, 16:48:37

Tom,

> Do you have a better organizing principle than what's there now?

It's mostly detail stuff: putting VACUUM and Autovac together, breaking up 
some subsections that now have too many options in them into grouped. 

Client Connection Defaults has somehow become a catchall secton for *any* 
USERSET variable, regardless of purpose.  I'd like to trim it back down and 
assign some of those variables to appropriate sections. 

On the more hypothetical basis I was thinking of adding a section at the top 
with the 7-9 most common options that people *need* to set; this would make 
PostgreSQL.conf much more accessable but would result in duplicate options 
which might cause some issues.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: .conf File Organization

From

Tom Lane

Date:

12 June 2007, 16:51:56

Josh Berkus <josh@agliodbs.com> writes:
> On the more hypothetical basis I was thinking of adding a section at the top 
> with the 7-9 most common options that people *need* to set; this would make 
> PostgreSQL.conf much more accessable but would result in duplicate options 
> which might cause some issues.

Doesn't sound like a good idea, but maybe there's a case for a comment
there saying "these are the most important ones to look at"?
        regards, tom lane

Re: .conf File Organization

From

Josh Berkus

Date:

12 June 2007, 17:07:24

Tom,

> Doesn't sound like a good idea, but maybe there's a case for a comment
> there saying "these are the most important ones to look at"?

Yeah, probably need to do that.  Seems user-unfriendly, but loading a foot gun 
by having some options appear twice in the file seems much worse.  I'll also 
add some notes on how to set these values.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Controlling Load Distributed Checkpoints

From

"Jim C. Nasby"

Date:

13 June 2007, 15:05:40

On Sun, Jun 10, 2007 at 08:49:24PM +0100, Heikki Linnakangas wrote:
> Jim C. Nasby wrote:
> >On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:
> >>Heikki Linnakangas <heikki@enterprisedb.com> writes:
> >>>Thinking about this whole idea a bit more, it occured to me that the
> >>>current approach to write all, then fsync all is really a historical
> >>>artifact of the fact that we used to use the system-wide sync call
> >>>instead of fsyncs to flush the pages to disk. That might not be the best
> >>>way to do things in the new load-distributed-checkpoint world.
> >>>How about interleaving the writes with the fsyncs?
> >>I don't think it's a historical artifact at all: it's a valid reflection
> >>of the fact that we don't know enough about disk layout to do low-level
> >>I/O scheduling.  Issuing more fsyncs than necessary will do little
> >>except guarantee a less-than-optimal scheduling of the writes.
> >
> >If we extended relations by more than 8k at a time, we would know a lot
> >more about disk layout, at least on filesystems with a decent amount of
> >free space.
>
> I doubt it makes that much difference. If there was a significant amount
> of fragmentation, we'd hear more complaints about seq scan performance.
>
> The issue here is that we don't know which relations are on which drives
> and controllers, how they're striped, mirrored etc.

Actually, isn't pre-allocation one of the tricks that Greenplum uses to
get it's seqscan performance?
--
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Controlling Load Distributed Checkpoints

From

"Florian G. Pflug"

Date:

13 June 2007, 19:05:09

Heikki Linnakangas wrote:
> Jim C. Nasby wrote:
>> On Thu, Jun 07, 2007 at 10:16:25AM -0400, Tom Lane wrote:
>>> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>>>> Thinking about this whole idea a bit more, it occured to me that the 
>>>> current approach to write all, then fsync all is really a historical 
>>>> artifact of the fact that we used to use the system-wide sync call 
>>>> instead of fsyncs to flush the pages to disk. That might not be the 
>>>> best way to do things in the new load-distributed-checkpoint world.
>>>> How about interleaving the writes with the fsyncs?
>>> I don't think it's a historical artifact at all: it's a valid reflection
>>> of the fact that we don't know enough about disk layout to do low-level
>>> I/O scheduling.  Issuing more fsyncs than necessary will do little
>>> except guarantee a less-than-optimal scheduling of the writes.
>>
>> If we extended relations by more than 8k at a time, we would know a lot
>> more about disk layout, at least on filesystems with a decent amount of
>> free space.
> 
> I doubt it makes that much difference. If there was a significant amount 
> of fragmentation, we'd hear more complaints about seq scan performance.

OTOH, extending a relation that uses N pages by something like
min(ceil(N/1024), 1024)) pages might help some filesystems to
avoid fragmentation, and hardly introduce any waste (about 0.1%
in the worst case). So if it's not too hard to do it might
be worthwhile, even if it turns out that most filesystems deal
well with the current allocation pattern.

greetings, Florian Pflug

Re: Controlling Load Distributed Checkpoints

From

PFC

Date:

13 June 2007, 19:09:27

>> >If we extended relations by more than 8k at a time, we would know a lot
>> >more about disk layout, at least on filesystems with a decent amount of
>> >free space.
>>
>> I doubt it makes that much difference. If there was a significant amount
>> of fragmentation, we'd hear more complaints about seq scan performance.
>>
>> The issue here is that we don't know which relations are on which drives
>> and controllers, how they're striped, mirrored etc.
>
> Actually, isn't pre-allocation one of the tricks that Greenplum uses to
> get it's seqscan performance?
My tests here show that, at least on reiserfs, after a few hours of  
benchmark torture (this represents several million write queries), table  
files become significantly fragmented. I believe the table and index files  
get extended more or less simultaneously and end up somehow a bit mixed up  
on disk. Seq scan perf suffers. reiserfs doesn't have an excellent  
fragmentation behaviour... NTFS is worse than hell in this respect. So,  
pre-alloc could be a good idea. Brutal Defrag (cp /var/lib/postgresql to  
somewhere and back) gets seq scan perf back to disk throughput.
Also, by the way, InnoDB uses a BTree organized table. The advantage is  
that data is always clustered on the primary key (which means you have to  
use something as your primary key that isn't necessary "natural", you have  
to choose it to get good clustering, and you can't always do it right, so  
it somehow, in the end, sucks rather badly). Anyway, seq-scan on InnoDB is  
very slow because, as the btree grows (just like postgres indexes) pages  
are split and scanning the pages in btree order becomes a mess of seeks.  
So, seq scan in InnoDB is very very slow unless periodic OPTIMIZE TABLE is  
applied. (caveat to the postgres TODO item "implement automatic table  
clustering"...)

Sorted writes in checkpoint

From

ITAGAKI Takahiro

Date:

14 June 2007, 04:39:38

Greg Smith <gsmith@gregsmith.com> wrote:

> On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > If the kernel can treat sequential writes better than random writes, is
> > it worth sorting dirty buffers in block order per file at the start of
> > checkpoints?

I wrote and tested the attached sorted-writes patch base on Heikki's
ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.

  tests                    | pgbench | DBT-2 response time (avg/90%/max)
---------------------------+---------+-----------------------------------
 LDC only                  | 181 tps | 1.12 / 4.38 / 12.13 s
 + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
 + Sorted writes           | 224 tps | 0.36 / 0.80 /  8.11 s

(*) Don't write buffers that were dirtied after starting the checkpoint.

machine : 2GB-ram, SCSI*4 RAID-5
pgbench : -s400 -t40000 -c10  (about 5GB of database)
DBT-2   : 60WH (about 6GB of database)

> I think it has the potential to improve things.  There are three obvious
> and one subtle argument against it I can think of:
>
> 1) Extra complexity for something that may not help.  This would need some
> good, robust benchmarking improvements to justify its use.

Exactly. I think we need a discussion board for I/O performance issues.
Can I use Developers Wiki for this purpose?  Since performance graphs and
result tables are important for the discussion, so it might be better
than mailing lists, that are text-based.

> 2) Block number ordering may not reflect actual order on disk.  While
> true, it's got to be better correlated with it than writing at random.
> 3) The OS disk elevator should be dealing with this issue, particularly
> because it may really know the actual disk ordering.

Yes, both are true. However, I think there is pretty high correlation
in those orderings. In addition, we should use filesystem to assure
those orderings correspond to each other. For example, pre-allocation
of files might help us, as has often been discussed.

> Here's the subtle thing:  by writing in the same order the LRU scan occurs
> in, you are writing dirty buffers in the optimal fashion to eliminate
> client backend writes during BuferAlloc.  This makes the checkpoint a
> really effective LRU clearing mechanism.  Writing in block order will
> change that.

The issue will probably go away after we have LDC, because it writes LRU
buffers during checkpoints.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

sorted-ckpt.patch

Re: Controlling Load Distributed Checkpoints

From

Gregory Stark

Date:

14 June 2007, 08:41:15

"PFC" <lists@peufeu.com> writes:

> Anyway, seq-scan on InnoDB is very slow because, as the btree grows (just
> like postgres indexes) pages are split and scanning the pages in btree order
> becomes a mess of seeks. So, seq scan in InnoDB is very very slow unless
> periodic OPTIMIZE TABLE is applied. (caveat to the postgres TODO item
> "implement automatic table clustering"...)

Heikki already posted a patch which goes a long way towards implementing what
I think this patch refers to: trying to maintaining the cluster ordering on
updates and inserts.

It does it without changing the basic table structure at all. On updates and
inserts it consults the indexam of the clustered index to ask if for a
suggested block. If the index's suggested block has enough free space then the
tuple is put there.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Sorted writes in checkpoint

From

Gregory Stark

Date:

14 June 2007, 08:45:48

"ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp> writes:

> Exactly. I think we need a discussion board for I/O performance issues.
> Can I use Developers Wiki for this purpose?  Since performance graphs and
> result tables are important for the discussion, so it might be better
> than mailing lists, that are text-based.

I would suggest keeping the discussion on mail and including links to refer to
charts and tables in the wiki.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Sorted writes in checkpoint

From

Heikki Linnakangas

Date:

14 June 2007, 10:22:43

ITAGAKI Takahiro wrote:
> Greg Smith <gsmith@gregsmith.com> wrote:
>> On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
>>> If the kernel can treat sequential writes better than random writes, is 
>>> it worth sorting dirty buffers in block order per file at the start of 
>>> checkpoints?
> 
> I wrote and tested the attached sorted-writes patch base on Heikki's
> ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
> 
>   tests                    | pgbench | DBT-2 response time (avg/90%/max)
> ---------------------------+---------+-----------------------------------
>  LDC only                  | 181 tps | 1.12 / 4.38 / 12.13 s
>  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
>  + Sorted writes           | 224 tps | 0.36 / 0.80 /  8.11 s
> 
> (*) Don't write buffers that were dirtied after starting the checkpoint.
> 
> machine : 2GB-ram, SCSI*4 RAID-5
> pgbench : -s400 -t40000 -c10  (about 5GB of database)
> DBT-2   : 60WH (about 6GB of database)

Wow, I didn't expect that much gain from the sorted writes. How was LDC 
configured?

>> 3) The OS disk elevator should be dealing with this issue, particularly 
>> because it may really know the actual disk ordering.

Yeah, but we don't give the OS that much chance to coalesce writes when 
we spread them out.

>> Here's the subtle thing:  by writing in the same order the LRU scan occurs 
>> in, you are writing dirty buffers in the optimal fashion to eliminate 
>> client backend writes during BuferAlloc.  This makes the checkpoint a 
>> really effective LRU clearing mechanism.  Writing in block order will 
>> change that.
> 
> The issue will probably go away after we have LDC, because it writes LRU
> buffers during checkpoints.

I think so too.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Sorted writes in checkpoint

From

Greg Smith

Date:

14 June 2007, 12:58:39

On Thu, 14 Jun 2007, ITAGAKI Takahiro wrote:

> I think we need a discussion board for I/O performance issues. Can I use 
> Developers Wiki for this purpose?  Since performance graphs and result 
> tables are important for the discussion, so it might be better than 
> mailing lists, that are text-based.

I started pushing some of my stuff over to there recently to make it 
easier to edit and other people can expand with their expertise.
http://developer.postgresql.org/index.php/Buffer_Cache%2C_Checkpoints%2C_and_the_BGW 
is what I've done so far on this particular topic.

What I would like to see on the Wiki first are pages devoted to how to run 
the common benchmarks people use for useful performance testing.  A recent 
thread on one of the lists reminded me how easy it is to get worthless 
results out of DBT2 if you don't have any guidance on that.  I've already 
got a stack of documentation about how to wrestle with pgbench and am 
generating more.

The problem with using the Wiki as the main focus is that when you get to 
the point that you want to upload detailed test results, that interface 
really isn't appropriate for it.  For example, in the last day I've 
collected up data from about 400 short tests runs that generated 800 
graphs.  It's all organized as HTML so you can drill down into the 
specific tests that executed oddly.  Heikki's DBT2 resuls are similar; not 
as many files, because he's running longer tests, but the navigation is 
even more complicated.

There is no way to easily put that type and level of information into the 
Wiki page.  You really just need a web server to copy the results onto. 
Then the main problem you have to be concerned about is a repeat of the 
OSDL situation, where all the results just dissapear if their hosting 
sponsor goes away.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Sorted writes in checkpoint

From

"Simon Riggs"

Date:

14 June 2007, 15:18:55

On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote:
> Greg Smith <gsmith@gregsmith.com> wrote:
> 
> > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > > If the kernel can treat sequential writes better than random writes, is 
> > > it worth sorting dirty buffers in block order per file at the start of 
> > > checkpoints?
> 
> I wrote and tested the attached sorted-writes patch base on Heikki's
> ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
> 
>   tests                    | pgbench | DBT-2 response time (avg/90%/max)
> ---------------------------+---------+-----------------------------------
>  LDC only                  | 181 tps | 1.12 / 4.38 / 12.13 s
>  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
>  + Sorted writes           | 224 tps | 0.36 / 0.80 /  8.11 s
> 
> (*) Don't write buffers that were dirtied after starting the checkpoint.
> 
> machine : 2GB-ram, SCSI*4 RAID-5
> pgbench : -s400 -t40000 -c10  (about 5GB of database)
> DBT-2   : 60WH (about 6GB of database)

I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
of writes has been saved by doing that? We would expect a small
percentage of blocks only and so that shouldn't make a significant
difference. I thought we discussed this before, about a year ago. It
would be easy to get that wrong and to avoid writing a block that had
been re-dirtied after the start of checkpoint, but was already dirty
beforehand. How long was the write phase of the checkpoint, how long
between checkpoints?

I can see the sorted writes having an effect because the OS may not
receive blocks within a sufficient time window to fully optimise them.
That effect would grow with increasing sizes of shared_buffers and
decrease with size of controller cache. How big was the shared buffers
setting? What OS scheduler are you using? The effect would be greatest
when using Deadline.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Sorted writes in checkpoint

From

"Gregory Maxwell"

Date:

14 June 2007, 23:37:21

On 6/14/07, Simon Riggs <simon@2ndquadrant.com> wrote:
> On Thu, 2007-06-14 at 16:39 +0900, ITAGAKI Takahiro wrote:
> > Greg Smith <gsmith@gregsmith.com> wrote:
> >
> > > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > > > If the kernel can treat sequential writes better than random writes, is
> > > > it worth sorting dirty buffers in block order per file at the start of
> > > > checkpoints?
> >
> > I wrote and tested the attached sorted-writes patch base on Heikki's
> > ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
> >
> >   tests                    | pgbench | DBT-2 response time (avg/90%/max)
> > ---------------------------+---------+-----------------------------------
> >  LDC only                  | 181 tps | 1.12 / 4.38 / 12.13 s
> >  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
> >  + Sorted writes           | 224 tps | 0.36 / 0.80 /  8.11 s
> >
> > (*) Don't write buffers that were dirtied after starting the checkpoint.
> >
> > machine : 2GB-ram, SCSI*4 RAID-5
> > pgbench : -s400 -t40000 -c10  (about 5GB of database)
> > DBT-2   : 60WH (about 6GB of database)
>
> I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
> of writes has been saved by doing that? We would expect a small
> percentage of blocks only and so that shouldn't make a significant
> difference. I thought we discussed this before, about a year ago. It
> would be easy to get that wrong and to avoid writing a block that had
> been re-dirtied after the start of checkpoint, but was already dirty
> beforehand. How long was the write phase of the checkpoint, how long
> between checkpoints?
>
> I can see the sorted writes having an effect because the OS may not
> receive blocks within a sufficient time window to fully optimise them.
> That effect would grow with increasing sizes of shared_buffers and
> decrease with size of controller cache. How big was the shared buffers
> setting? What OS scheduler are you using? The effect would be greatest
> when using Deadline.

Linux has some instrumentation that might be useful for this testing,

echo 1 > /proc/sys/vm/block_dump
Will have the kernel log all physical IO (disable syslog writing to
disk before turning it on if you don't want the system to blow up).

Certainly the OS elevator should be working well enough to not see
that much of an improvement. Perhaps frequent fsync behavior is having
unintended interaction with the elevator?  ... It might be worthwhile
to contact some Linux kernel developers and see if there is some
misunderstanding.

Re: Sorted writes in checkpoint

From

Greg Smith

Date:

15 June 2007, 01:53:45

On Thu, 14 Jun 2007, Gregory Maxwell wrote:

> Linux has some instrumentation that might be useful for this testing,
> echo 1 > /proc/sys/vm/block_dump

That bit was developed for tracking down who was spinning the hard drive 
up out of power saving mode, and I was under the impression that very 
rough feature isn't useful at all here.  I just tried to track down again 
where I got that impression from, and I think it was this thread:

http://linux.slashdot.org/comments.pl?sid=231817&cid=18832379

This mentions general issues figuring out who was responsible for a write 
and specifically mentions how you'll have to reconcile two different paths 
if fsync is mixed in.  Not saying it won't work, it's just obvious using 
the block_dump output isn't a simple job.

(For anyone who would like an intro to this feature, try 
http://www.linuxjournal.com/node/7539/print and 
http://toadstool.se/journal/2006/05/27/monitoring-filesystem-activity-under-linux-with-block_dump 
)

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Sorted writes in checkpoint

From

"Zeugswetter Andreas ADI SD"

Date:

15 June 2007, 06:14:30

> >   tests                    | pgbench | DBT-2 response time
> (avg/90%/max)
> >
> ---------------------------+---------+--------------------------------
> > ---------------------------+---------+---
> >  LDC only                  | 181 tps | 1.12 / 4.38 / 12.13 s
> >  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
> >  + Sorted writes           | 224 tps | 0.36 / 0.80 /  8.11 s
> >
> > (*) Don't write buffers that were dirtied after starting
> the checkpoint.
> >
> > machine : 2GB-ram, SCSI*4 RAID-5
> > pgbench : -s400 -t40000 -c10  (about 5GB of database)
> > DBT-2   : 60WH (about 6GB of database)
>
> I'm very surprised by the BM_CHECKPOINT_NEEDED results. What
> percentage of writes has been saved by doing that? We would
> expect a small percentage of blocks only and so that
> shouldn't make a significant difference. I thought we

Wouldn't pages that are dirtied during the checkpoint also usually be
rather hot ?
Thus if we lock one of those for writing, the chances are high that a
client needs to wait for the lock ?
A write os call should usually be very fast, but when the IO gets
bottlenecked it might easily become slower.

Probably the recent result, that it saves ~53% of the writes, is
sufficient explanation though.

Very nice results :-) Looks like we want all of it including the sort.

Andreas

Re: Sorted writes in checkpoint

From

ITAGAKI Takahiro

Date:

15 June 2007, 06:33:51

"Simon Riggs" <simon@2ndquadrant.com> wrote:

> >   tests                    | pgbench | DBT-2 response time (avg/90%/max)
> > ---------------------------+---------+-----------------------------------
> >  LDC only                  | 181 tps | 1.12 / 4.38 / 12.13 s
> >  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
> >  + Sorted writes           | 224 tps | 0.36 / 0.80 /  8.11 s
> 
> I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
> of writes has been saved by doing that?
> How long was the write phase of the checkpoint, how long
> between checkpoints?
>
> I can see the sorted writes having an effect because the OS may not
> receive blocks within a sufficient time window to fully optimise them.
> That effect would grow with increasing sizes of shared_buffers and
> decrease with size of controller cache. How big was the shared buffers
> setting? What OS scheduler are you using? The effect would be greatest
> when using Deadline.

I didn't tune OS parameters, used default values.
In terms of cache amounts, postgres buffers were larger than kernel
write pool and controller cache. that's why the OS could not optimise
writes enough in checkpoint, I think.
 - 200MB <- RAM * dirty_background_ratio - 128MB <- Controller cache - 2GB   <- postgres shared_buffers

I forget to gather detail I/O information in the tests.
I'll retry it and report later.

RAM              2GB
Controller cache 128MB
shared_buffers   1GB
checkpoint_timeout       = 15min
checkpoint_write_percent = 50.0

RHEL4 (Linux 2.6.9-42.0.2.EL)
vm.dirty_background_ratio    = 10
vm.dirty_ratio               = 40
vm.dirty_expire_centisecs    = 3000
vm.dirty_writeback_centisecs = 500
Using cfq io scheduler

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Sorted writes in checkpoint

From

"Simon Riggs"

Date:

15 June 2007, 07:58:01

On Fri, 2007-06-15 at 18:33 +0900, ITAGAKI Takahiro wrote:
> "Simon Riggs" <simon@2ndquadrant.com> wrote:
> 
> > >   tests                    | pgbench | DBT-2 response time (avg/90%/max)
> > > ---------------------------+---------+-----------------------------------
> > >  LDC only                  | 181 tps | 1.12 / 4.38 / 12.13 s
> > >  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
> > >  + Sorted writes           | 224 tps | 0.36 / 0.80 /  8.11 s
> > 
> > I'm very surprised by the BM_CHECKPOINT_NEEDED results. What percentage
> > of writes has been saved by doing that?
> > How long was the write phase of the checkpoint, how long
> > between checkpoints?
> >
> > I can see the sorted writes having an effect because the OS may not
> > receive blocks within a sufficient time window to fully optimise them.
> > That effect would grow with increasing sizes of shared_buffers and
> > decrease with size of controller cache. How big was the shared buffers
> > setting? What OS scheduler are you using? The effect would be greatest
> > when using Deadline.
> 
> I didn't tune OS parameters, used default values.
> In terms of cache amounts, postgres buffers were larger than kernel
> write pool and controller cache. that's why the OS could not optimise
> writes enough in checkpoint, I think.
> 
>   - 200MB <- RAM * dirty_background_ratio
>   - 128MB <- Controller cache
>   - 2GB   <- postgres shared_buffers
> 
> I forget to gather detail I/O information in the tests.
> I'll retry it and report later.
> 
> RAM              2GB
> Controller cache 128MB
> shared_buffers   1GB
> checkpoint_timeout       = 15min
> checkpoint_write_percent = 50.0
> 
> RHEL4 (Linux 2.6.9-42.0.2.EL)
> vm.dirty_background_ratio    = 10
> vm.dirty_ratio               = 40
> vm.dirty_expire_centisecs    = 3000
> vm.dirty_writeback_centisecs = 500
> Using cfq io scheduler

Sounds like sorting the buffers before checkpoint is going to be a win
once we go above about ~128MB. We can do a simple test on NBuffers,
rather than have a sort_blocks_at_checkoint (!) GUC.

But it does seem there is a win for larger settings of shared_buffers.

Does performance go up in the non-sorted case if we make shared_buffers
smaller? Sounds like it might. We should check that first.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Sorted writes in checkpoint

From

Bruce Momjian

Date:

11 March 2008, 17:06:03

Added to TODO:

* Consider sorting writes during checkpoint
 http://archives.postgresql.org/pgsql-hackers/2007-06/msg00541.php


---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> Greg Smith <gsmith@gregsmith.com> wrote:
> 
> > On Mon, 11 Jun 2007, ITAGAKI Takahiro wrote:
> > > If the kernel can treat sequential writes better than random writes, is 
> > > it worth sorting dirty buffers in block order per file at the start of 
> > > checkpoints?
> 
> I wrote and tested the attached sorted-writes patch base on Heikki's
> ldc-justwrites-1.patch. There was obvious performance win on OLTP workload.
> 
>   tests                    | pgbench | DBT-2 response time (avg/90%/max)
> ---------------------------+---------+-----------------------------------
>  LDC only                  | 181 tps | 1.12 / 4.38 / 12.13 s
>  + BM_CHECKPOINT_NEEDED(*) | 187 tps | 0.83 / 2.68 /  9.26 s
>  + Sorted writes           | 224 tps | 0.36 / 0.80 /  8.11 s
> 
> (*) Don't write buffers that were dirtied after starting the checkpoint.
> 
> machine : 2GB-ram, SCSI*4 RAID-5
> pgbench : -s400 -t40000 -c10  (about 5GB of database)
> DBT-2   : 60WH (about 6GB of database)
> 
> 
> > I think it has the potential to improve things.  There are three obvious 
> > and one subtle argument against it I can think of:
> > 
> > 1) Extra complexity for something that may not help.  This would need some 
> > good, robust benchmarking improvements to justify its use.
> 
> Exactly. I think we need a discussion board for I/O performance issues.
> Can I use Developers Wiki for this purpose?  Since performance graphs and
> result tables are important for the discussion, so it might be better
> than mailing lists, that are text-based.
> 
> 
> > 2) Block number ordering may not reflect actual order on disk.  While 
> > true, it's got to be better correlated with it than writing at random.
> > 3) The OS disk elevator should be dealing with this issue, particularly 
> > because it may really know the actual disk ordering.
> 
> Yes, both are true. However, I think there is pretty high correlation
> in those orderings. In addition, we should use filesystem to assure
> those orderings correspond to each other. For example, pre-allocation
> of files might help us, as has often been discussed.
> 
> 
> > Here's the subtle thing:  by writing in the same order the LRU scan occurs 
> > in, you are writing dirty buffers in the optimal fashion to eliminate 
> > client backend writes during BuferAlloc.  This makes the checkpoint a 
> > really effective LRU clearing mechanism.  Writing in block order will 
> > change that.
> 
> The issue will probably go away after we have LDC, because it writes LRU
> buffers during checkpoints.
> 
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
> 

[ Attachment, skipping... ]

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://postgres.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +