Home > mailing lists

Thread: Load distributed checkpoint

Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

07 December 2006, 02:33:50

This is a proposal for load distributed checkpoint.
(It is presented on postgresql anniversary summit in last summer.)

We offen encounters performance gap during checkpoint. The reason is write
bursts. Storage devices are too overworked in checkpoint, so they can not
supply usual transaction processing.

Checkpoint consists of the following four steps, and the major performance
problem is 2nd step. All dirty buffers are written without interval in it.
1. Query information (REDO pointer, next XID etc.)2. Write dirty pages in buffer pool3. Flush all modified files4.
Updatecontrol file

I suggested to write pages with sleeping in 2nd step, using normal activity
of the background writer. It is something like cost-based vacuum delay.
Background writer has two pointers, 'ALL' and 'LRU', indicating where to
write out in buffer pool. We can wait for the ALL clock-hand going around
to guarantee all pages to be written.

Here is pseudo-code for the proposed method. The internal loop is just the
same as bgwriter's activity.
PrepareCheckPoint(); -- do step 1 Reset num_of_scanned_pages by ALL activity; do { BgBufferSync(); -- do a part
ofstep 2 sleep(bgwriter_delay); } while (num_of_scanned_pages < shared_buffers); CreateCheckPoint(); -- do step 3
and4

We may accelerate background writer to reduce works at checkpoint instead of
the method, but it introduces another performance problem; Extra pressure
is always put on the storage devices to keep the number of dirty pages low.

I'm working about adjusting the progress of checkpoint to checkpoint timeout
and wal segments limitation automatically to avoid overlap of two checkpoints.
I'll post a patch sometime soon.

Comments and suggestions welcome.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

07 December 2006, 08:12:06

Hello, Itagaki-san


> This is a proposal for load distributed checkpoint.
> We offen encounters performance gap during checkpoint. The reason is
write
> bursts. Storage devices are too overworked in checkpoint, so they
can not
> supply usual transaction processing.

Good! You are focusing on a very important problem. System designers
don't like unsteady performance -- sudden slowdown. Commercial
database systems have made efforts to provide steady performance. I've
seen somewhere the report that Oracle provides stable throughput even
during checkpoint. I wonder how it is implemented.

> I'm working about adjusting the progress of checkpoint to checkpoint
timeout
> and wal segments limitation automatically to avoid overlap of two
checkpoints.

Have you already tried your patch? What's your first impression about
the improvement? I'm very interested. On my machine, pgbench shows 210
tps at first, however, it drops to 70 tps during checkpoint.


> Checkpoint consists of the following four steps, and the major
performance
> problem is 2nd step. All dirty buffers are written without interval
in it.
> 1. Query information (REDO pointer, next XID etc.)
> 2. Write dirty pages in buffer pool
> 3. Flush all modified files
> 4. Update control file

Hmm. Isn't it possible that step 3 affects the performance greatly?
I'm sorry if you have already identified step 2 as disturbing
backends.

As you know, PostgreSQL does not transfer the data to disk when
write()ing. Actual transfer occurs when fsync()ing at checkpoints,
unless the filesystem cache runs short. So, disk is overworked at
fsync()s.

What processing of bgwriter (shared locking of buffers, write(),
fsync(), flushing of log for WAL, etc.) do you consider (or did you
detect) as disturbing what processing of backends (exclusive locking
of buffers, putting log records onto WAL buffers, flushing log at
commit, writing dirty buffers when shared buffers run short)?

Re: Load distributed checkpoint

From

"Kevin Grittner"

Date:

07 December 2006, 12:03:18

>>> On Thu, Dec 7, 2006 at 12:05 AM, in message
<20061207144843.6269.ITAGAKI.TAKAHIRO@oss.ntt.co.jp>, ITAGAKI Takahiro
<itagaki.takahiro@oss.ntt.co.jp> wrote:
>
> We offen encounters performance gap during checkpoint. The reason is write
> bursts. Storage devices are too overworked in checkpoint, so they can not
> supply usual transaction processing.
When we first switched our web site to PostgreSQL, this was one of our biggest problems.  Queries which normally run in
afew milliseconds were hitting the 20 second limit we impose in our web application.  These were happening in bursts
whichsuggested that they were caused by checkpoints.  We adjusted the background writer configuration and nearly
eliminatedthe problem.bgwriter_all_maxpages           | 600bgwriter_all_percent            | 10bgwriter_delay
      | 200bgwriter_lru_maxpages           | 200bgwriter_lru_percent            | 20 
Between the xfs caching and the batter backed cache in the RAID controller, the disk writes seemed to settle out pretty
well.
> Checkpoint consists of the following four steps, and the major performance
> problem is 2nd step. All dirty buffers are written without interval in it.
>
>  1. Query information (REDO pointer, next XID etc.)
>  2. Write dirty pages in buffer pool
>  3. Flush all modified files
>  4. Update control file
>
> I suggested to write pages with sleeping in 2nd step, using normal activity
> of the background writer. It is something like cost- based vacuum delay.
> Background writer has two pointers, 'ALL' and 'LRU', indicating where to
> write out in buffer pool. We can wait for the ALL clock- hand going around
> to guarantee all pages to be written.
>
> Here is pseudo- code for the proposed method. The internal loop is just the
> same as bgwriter's activity.
>
>   PrepareCheckPoint();  --  do step 1
>   Reset num_of_scanned_pages by ALL activity;
>   do {
>       BgBufferSync();   --  do a part of step 2
>       sleep(bgwriter_delay);
>   } while (num_of_scanned_pages < shared_buffers);
>   CreateCheckPoint();   --  do step 3 and 4
Would the background writer be disabled during this extended checkpoint?  How is it better to concentrate step 2 in an
extendedcheckpoint periodically rather than consistently in the background writer? 
> We may accelerate background writer to reduce works at checkpoint instead of
> the method, but it introduces another performance problem; Extra pressure
> is always put on the storage devices to keep the number of dirty pages low.
Doesn't the file system caching logic combined with a battery backed cache in the controller cover this, or is your
patchto help out those who don't have battery backed controller cache?  What would the impact of your patch be on
environmentslike ours?  Will there be any affect on PITR techniques, in terms of how current the copied WAL files would
be?
> I'm working about adjusting the progress of checkpoint to checkpoint timeout
> and wal segments limitation automatically to avoid overlap of two
> checkpoints.
> I'll post a patch sometime soon.
>
> Comments and suggestions welcome.
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center

Re: Load distributed checkpoint

From

Ron Mayer

Date:

07 December 2006, 15:02:32

Takayuki Tsunakawa wrote:
> Hello, Itagaki-san
>> Checkpoint consists of the following four steps, and the major
>> performance
>> problem is 2nd step. All dirty buffers are written without interval
>> in it.
>> 1. Query information (REDO pointer, next XID etc.)
>> 2. Write dirty pages in buffer pool
>> 3. Flush all modified files
>> 4. Update control file
> 
> Hmm. Isn't it possible that step 3 affects the performance greatly?
> I'm sorry if you have already identified step 2 as disturbing
> backends.
> 
> As you know, PostgreSQL does not transfer the data to disk when
> write()ing. Actual transfer occurs when fsync()ing at checkpoints,
> unless the filesystem cache runs short. So, disk is overworked at
> fsync()s.

It seems to me that virtual memory settings of the OS will determine
if step 2 or step 3 causes much of the actual disk I/O.

In particular, on Linux, things like /proc/sys/vm/dirty_expire_centisecs
and dirty_writeback_centisecs and possibly dirty_background_ratio
would affect this.  If those numbers are high, ISTM most write()s
from step 2 would wait for the flush in step 3.  If I understand
correctly, if the dirty_expire_centisecs number is low, most write()s
from step 2 would happen before step 3 because of the pdflush daemons.
I expect other OS's would have different but similar knobs to tune this.

It seems to me that the most portable way postgresql could force
the I/O to be balanced would be to insert otherwise unnecessary
fsync()s into step 2; but that it might (not sure why) be better
to handle this through OS-specific tuning outside of postgres.

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

07 December 2006, 22:21:41

Hello,

As Mr. Mayer points out, which of step 2 or 3 actually causes I/O
depends on the VM settings, and the amount of RAM available for file
system cache.

"Ron Mayer" <rm_pg@cheapcomplexdevices.com> wrote in message
news:45786549.2000602@cheapcomplexdevices.com...
> It seems to me that the most portable way postgresql could force
> the I/O to be balanced would be to insert otherwise unnecessary
> fsync()s into step 2; but that it might (not sure why) be better
> to handle this through OS-specific tuning outside of postgres.

I'm afraid it is difficult for system designers to expect steady
throughput/response time, as long as PostgreSQL depends on the
flushing of file system cache. How does Oracle provide stable
performance?
Though I'm not sure, isn't it the key to use O_SYNC so that write()s
transfer data to disk? That is, PostgreSQL completely controls the
timing of data transfer. Moreover, if possible, it's better to bypass
the file system cache, using such as O_DIRECT flag for open() on UNIX
and FILE_FLAG_NO_BUFFERING flag for CreateFile() on Windows. As far as
I know, SQL Server and Oracle does this. I think commercial DBMSs do
the same thing to control and anticipate the I/O activity without
being influenced by VM policy.
If PostgreSQL is to use these, writing of dirty buffers has to be
improved. To decrease the count of I/O, pages adjacent on disk that
are also adjacent on memory must be written with one write().

From: "Kevin Grittner" <Kevin.Grittner@wicourts.gov>
> Would the background writer be disabled during this extended
checkpoint?  How is it better to concentrate step 2 in an extended
checkpoint periodically rather than consistently in the background
writer?
> Will there be any affect on PITR techniques, in terms of how current
the copied WAL files would be?

Extending the checkpoint can also cause extended downtime, to put it
in an extreme way. I understand that checkpoints occur during crash
recovery and PITR, so time for those operations would get longer. The
checkpoint also occurs at server shutdown. However, distinction among
these might be made, and undesirable extension could be avoided.

Re: Load distributed checkpoint

From

Greg Smith

Date:

07 December 2006, 23:33:19

On Thu, 7 Dec 2006, Kevin Grittner wrote:

> Between the xfs caching and the batter backed cache in the RAID...

Mmmmm, battered cache.  You can deep fry anything nowadays.

> Would the background writer be disabled during this extended checkpoint?

The background writer is the same process that does the full buffer sweep 
at checkpoint time.  You wouldn't have to disable it because it would be 
busy doing this extended checkpoint instead of its normal job.

> How is it better to concentrate step 2 in an extended checkpoint 
> periodically rather than consistently in the background writer?

Right now, when the checkpoint flush is occuring, there is no background 
writer active--that process is handling the checkpoint.  Itagaki's 
suggestion is basically to take the current checkpoint code, which runs 
all in one burst, and spread it out over time.  I like the concept, as 
I've seen the behavior he's describing (even after tuning the background 
writer like you suggest and doing Linux disk tuning as Ron describes), but 
I think solving the problem is a little harder than suggested.

I have two concerns with the logic behind this approach.  The first is 
that if the background writer isn't keeping up with writing out all the 
dirty pages, what makes you think that running the checkpoint with a 
similar level of activity is going to?  If your checkpoint is taking a 
long time, it's because the background writer has an overwhelming load and 
needs to be bailed out.  Slowing down the writes with a lazier checkpoint 
process introduces the possibility that you'll hit a second checkpoint 
request before you're even finished cleaning up the first one, and then 
you're really in trouble.

Second, the assumption here is that it's writing the dirty buffers out 
that is the primary cause of the ugly slowdown.  I too believe it could 
just as easily be the fsync when it's done that killing you, and slowing 
down the writes isn't necessarily going to make that faster.

> Doesn't the file system caching logic combined with a battery backed 
> cache in the controller cover this, or is your patch to help out those 
> who don't have battery backed controller cache?

Unless your shared buffer pool is so small that you can write it all out 
onto the cache, that won't help much with this problem.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load distributed checkpoint

From

Greg Smith

Date:

07 December 2006, 23:59:39

On Fri, 8 Dec 2006, Takayuki Tsunakawa wrote:

> Though I'm not sure, isn't it the key to use O_SYNC so that write()s
> transfer data to disk?

If disk writes near checkpoint time aren't happening fast enough now, I 
doubt forcing a sync after every write will make that better.

> If PostgreSQL is to use these, writing of dirty buffers has to be
> improved. To decrease the count of I/O, pages adjacent on disk that
> are also adjacent on memory must be written with one write().

Sorting out which pages are next to one another on disk is one of the jobs 
the file system cache does; bypassing it will then make all that 
complicated sorting logic the job of the database engine.  And unlike the 
operating system, the engine doesn't even really know anything about the 
filesystem or physical disks involved, so what are the odds it's going to 
do a better job?  That's the concern with assuming direct writes are the 
solution here--you have to be smarter than the OS is for that to be an 
improvement, and you have a lot less information to make your decisions 
with than it does.

I like Itagaki's idea of some automatic tuning of the checkpoint timeout 
and wal segment parameters to help out with checkpoint behavior; that's an 
idea that would help a lot of people.  I'm not so sure that trying to make 
PostgreSQL take over operations that it's relying on the OS to handle 
intelligently right now will be as helpful, and it's a big programming job 
to even try.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

08 December 2006, 00:40:42

Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:

> >> 1. Query information (REDO pointer, next XID etc.)
> >> 2. Write dirty pages in buffer pool
> >> 3. Flush all modified files
> >> 4. Update control file
> > 
> > Hmm. Isn't it possible that step 3 affects the performance greatly?
> > I'm sorry if you have already identified step 2 as disturbing
> > backends.
> 
> It seems to me that virtual memory settings of the OS will determine
> if step 2 or step 3 causes much of the actual disk I/O.
>
> if the dirty_expire_centisecs number is low, most write()s
> from step 2 would happen before step 3 because of the pdflush daemons.

Exactly. It depends on OSes, kernel settings, and filesystems.
I tested the patch on Linux kernel 2.6.9-39, default settings, and ext3fs.
Maybe pdflush daemons were strong enough to write dirty buffers in kernel,
so step 2 was a main part and 3 was not.

There are technical issues to distribute step 3. We can write buffers
on a page basis, that is granular enough. However, fsync() is on a file
basis (1GB), so we can only control granularity of fsync roughly.
sync_file_range (http://lwn.net/Articles/178199/) or some special APIs
would be a help, but there are portability issues...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

Tom Lane

Date:

08 December 2006, 01:31:03

Greg Smith <gsmith@gregsmith.com> writes:
> On Fri, 8 Dec 2006, Takayuki Tsunakawa wrote:
>> Though I'm not sure, isn't it the key to use O_SYNC so that write()s
>> transfer data to disk?

> If disk writes near checkpoint time aren't happening fast enough now, I 
> doubt forcing a sync after every write will make that better.

I think the idea would be to force the writes to actually occur, rather
than just being scheduled (and then forced en-masse by an fsync at
checkpoint time).  Since the point of the bgwriter is to try to force
writes to occur *outside* checkpoint times, this seems to make sense.
I share your doubts about the value of slowing down checkpoints --- but
to the extent that bgwriter-issued writes are delayed by the kernel
until the next checkpoint, we are certainly not getting the desired
effect of leveling the write load.

>> To decrease the count of I/O, pages adjacent on disk that
>> are also adjacent on memory must be written with one write().

> Sorting out which pages are next to one another on disk is one of the jobs 
> the file system cache does; bypassing it will then make all that 
> complicated sorting logic the job of the database engine.

Indeed --- the knowledge that we don't know the physical layout has
always been the strongest argument against using O_SYNC in this way.
But I don't think anyone's made any serious tests.  A properly tuned
bgwriter should be eating only a "background" level of I/O effort
between checkpoints, so maybe it doesn't matter too much if it's not
optimally scheduled.
        regards, tom lane

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

08 December 2006, 01:31:04

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

> I'm afraid it is difficult for system designers to expect steady
> throughput/response time, as long as PostgreSQL depends on the
> flushing of file system cache. How does Oracle provide stable
> performance?
> Though I'm not sure, isn't it the key to use O_SYNC so that write()s
> transfer data to disk?

AFAIK, other databases use write() and fsync() in combination. They call
fsync() immediately after they write buffers in some small batches. Otherwise,
they uses asynchronous and direct I/O options. Therefore, dirty pages in
kernel buffers are keeped to be low at any time.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

08 December 2006, 01:49:22

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> We adjusted the background writer configuration and nearly eliminated
> the problem.
>  
>  bgwriter_all_maxpages           | 600
>  bgwriter_all_percent            | 10
>  bgwriter_delay                  | 200
>  bgwriter_lru_maxpages           | 200
>  bgwriter_lru_percent            | 20
>  
> Between the xfs caching and the batter backed cache in the RAID
> controller, the disk writes seemed to settle out pretty well.

Yes, higher bgwriter_all_maxpages is better for stability. I also do so
up to now. However, if some processes makes lots of dirty buffers in the
shortest time, ex.VACUUM, bgwriter begins to write many pages and affects
responce time.

We will be able to set bgwriter_all_maxpages to lower value with load
distributed checkpoint. It expands the range of tuning of bgwriter.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

"Jim C. Nasby"

Date:

08 December 2006, 03:13:12

On Thu, Dec 07, 2006 at 10:03:05AM -0600, Kevin Grittner wrote:
> When we first switched our web site to PostgreSQL, this was one of our biggest problems.  Queries which normally run
ina few milliseconds were hitting the 20 second limit we impose in our web application.  These were happening in bursts
whichsuggested that they were caused by checkpoints.  We adjusted the background writer configuration and nearly
eliminatedthe problem.
 
>  
>  bgwriter_all_maxpages           | 600
>  bgwriter_all_percent            | 10
>  bgwriter_delay                  | 200
>  bgwriter_lru_maxpages           | 200
>  bgwriter_lru_percent            | 20

Bear in mind that bgwriter settings should be considered in conjunction
with shared_buffer and checkpoint_timeout settings. For example, if you
have 60,000 shared buffers and a 300 second checkpoint interval, those
settings are going to be pretty aggressive.

Generally, I try and configure the all* settings so that you'll get 1
clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
have any actual tests to back that methodology up.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Load distributed checkpoint

From

Martijn van Oosterhout

Date:

08 December 2006, 06:44:38

On Fri, Dec 08, 2006 at 02:22:14PM +0900, ITAGAKI Takahiro wrote:
> AFAIK, other databases use write() and fsync() in combination. They call
> fsync() immediately after they write buffers in some small batches. Otherwise,
> they uses asynchronous and direct I/O options. Therefore, dirty pages in
> kernel buffers are keeped to be low at any time.

The "easy" solution I can think of is, when a session/backend is
exiting cleanly (client sent quit command), execute fsync() on some of
the descriptors before actually closing. At this point the user isn't
waiting anymore, so it can take its time.

The problem with fsync() remains that it can cause a write spike,
althoguh the more often you do it the less of an effect it would have.

A longer term solution maybe be create a daemon with system specific
information that monitors the load and tweaks parameters in response.
Not just postgresql parameters, but also system parameters. Even if it
never becomes part of postgresql, it will provide a way to test all
these "hunches" people have about optimising the system.

BTW, has anyone ever considered having the bgwriter do a NOTIFY
whenever it starts/ends a checkpoint, so client coulds monitor the
activity without reading the logs?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Load distributed checkpoint

From

"Inaam Rana"

Date:

08 December 2006, 08:17:44

On 12/7/06, Ron Mayer <rm_pg@cheapcomplexdevices.com> wrote:

Takayuki Tsunakawa wrote:
> Hello, Itagaki-san
>> Checkpoint consists of the following four steps, and the major
>> performance
>> problem is 2nd step. All dirty buffers are written without interval
>> in it.
>> 1. Query information (REDO pointer, next XID etc.)
>> 2. Write dirty pages in buffer pool
>> 3. Flush all modified files
>> 4. Update control file
>
> Hmm. Isn't it possible that step 3 affects the performance greatly?
> I'm sorry if you have already identified step 2 as disturbing
> backends.
>
> As you know, PostgreSQL does not transfer the data to disk when
> write()ing. Actual transfer occurs when fsync()ing at checkpoints,
> unless the filesystem cache runs short. So, disk is overworked at
> fsync()s.

It seems to me that virtual memory settings of the OS will determine
if step 2 or step 3 causes much of the actual disk I/O.

In particular, on Linux, things like /proc/sys/vm/dirty_expire_centisecs

dirty_expire_centisecs will have little, if any, effect on a box with consistent workload. Under uniform load bgwriter will keep pushing the buffers to fs cache which will result in eviction/flushing of pages to disk. That the pages will age quickly can lower the cap of dirty pages but it won't/can't handle sudden spike at checkpoint time.

and dirty_writeback_centisecs

Again on a system that encounters IO chokes on checkpoints pdflush is presumably working like crazy at that time. Reducing the gap between its wakeup calls will have probably very little impact on the checkpoint performance.

and possibly dirty_background_ratio

I have seen this to put a real cap on number of dirty pages during normal running. As regards checkpoints, this again seems to have little effect.

The problem while dealing with checkpoints is that we are dealing with two starkly different type of IO loads. The larger the number of shared_buffers the greater the spike in IO activity at checkpoint. AFAICS no specific vm tunables can smooth out checkpoint spikes by itself. There has to be some intelligence in the bgwriter to even the load out.

would affect this. If those numbers are high, ISTM most write()s
from step 2 would wait for the flush in step 3. If I understand
correctly, if the dirty_expire_centisecs number is low, most write()s
from step 2 would happen before step 3 because of the pdflush daemons.
I expect other OS's would have different but similar knobs to tune this.

It seems to me that the most portable way postgresql could force
the I/O to be balanced would be to insert otherwise unnecessary
fsync()s into step 2; but that it might (not sure why) be better
to handle this through OS-specific tuning outside of postgres.

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: Load distributed checkpoint

From

"Kevin Grittner"

Date:

08 December 2006, 11:26:41

>>> On Fri, Dec 8, 2006 at  1:13 AM, in message
<20061208071305.GG44124@nasby.net>,
"Jim C. Nasby" <jim@nasby.net> wrote: 
> On Thu, Dec 07, 2006 at 10:03:05AM - 0600, Kevin Grittner wrote:
>> We adjusted the background writer configuration
>> and nearly eliminated the  problem.
>>  
>>  bgwriter_all_maxpages           | 600
>>  bgwriter_all_percent            | 10
>>  bgwriter_delay                  | 200
>>  bgwriter_lru_maxpages           | 200
>>  bgwriter_lru_percent            | 20
> 
> Bear in mind that bgwriter settings should be considered in
conjunction
> with shared_buffer and checkpoint_timeout settings. For example, if
you
> have 60,000 shared buffers and a 300 second checkpoint interval,
those
> settings are going to be pretty aggressive.
> 
> Generally, I try and configure the all* settings so that you'll get
1
> clock- sweep per checkpoint_timeout. It's worked pretty well, but I
don't
> have any actual tests to back that methodology up.
We have 20,000 shared buffers and a 300 second checkpoint interval.
We got to these numbers somewhat scientifically.  I studied I/O
patterns under production load and figured we should be able to handle
about 800 writes in per 200 ms without causing problems.  I have to
admit that I based the percentages and the ratio between "all" and "lru"
on gut feel after musing over the documentation.
Since my values were such a dramatic change from the default, I boosted
the production settings a little bit each day and looked for feedback
from our web team.  Things improved with each incremental increase. 
When I got to my calculated values (above) they reported that these
timeouts had dropped to an acceptable level -- a few per day on a
website with 2 million hits per day.  We may benefit from further
adjustments, but since the problem is negligible with these settings,
there are bigger fish to fry at the moment.
By the way, if I remember correctly, these boxes have 256 MB battery
backed cache, while 20,000 buffers is 156.25 MB.
-Kevin

Re: Load distributed checkpoint

From

Tom Lane

Date:

08 December 2006, 12:43:43

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> "Jim C. Nasby" <jim@nasby.net> wrote: 
>> Generally, I try and configure the all* settings so that you'll get 1
>> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
>> have any actual tests to back that methodology up.

> We got to these numbers somewhat scientifically.  I studied I/O
> patterns under production load and figured we should be able to handle
> about 800 writes in per 200 ms without causing problems.  I have to
> admit that I based the percentages and the ratio between "all" and "lru"
> on gut feel after musing over the documentation.

I like Kevin's settings better than what Jim suggests.  If the bgwriter
only makes one sweep between checkpoints then it's hardly going to make
any impact at all on the number of dirty buffers the checkpoint will
have to write.  The point of the bgwriter is to reduce the checkpoint
I/O spike by doing writes between checkpoints, and to have any
meaningful impact on that, you'll need it to make the cycle several times.

Another point here is that you want checkpoints to be pretty far apart
to minimize the WAL load from full-page images.  So again, a bgwriter
that's only making one loop per checkpoint is not gonna be doing much.

I wonder whether it would be feasible to teach the bgwriter to get more
aggressive as the time for the next checkpoint approaches?  Writes
issued early in the interval have a much higher probability of being
wasted (because the page gets re-dirtied later).  But maybe that just
reduces to what Takahiro-san already suggested, namely that
checkpoint-time writes should be done with the same kind of scheduling
the bgwriter uses outside checkpoints.  We still have the problem that
the real I/O storm is triggered by fsync() not write(), and we don't
have a way to spread out the consequences of fsync().
        regards, tom lane

Re: Load distributed checkpoint

From

"Kevin Grittner"

Date:

08 December 2006, 13:02:02

>>> On Fri, Dec 8, 2006 at 10:43 AM, in message
<6439.1165596207@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> 
> I wonder whether it would be feasible to teach the bgwriter to get
more
> aggressive as the time for the next checkpoint approaches?  Writes
> issued early in the interval have a much higher probability of being
> wasted (because the page gets re- dirtied later).
But wouldn't the ones written earlier stand a much better chance of
being scheduled for a physical write before the fsync?  They aren't ALL
re-dirtied.  In our environment, we seem to be getting a lot written
before the fsync with our current settings.
> But maybe that just
> reduces to what Takahiro- san already suggested, namely that
> checkpoint- time writes should be done with the same kind of
scheduling
> the bgwriter uses outside checkpoints.  We still have the problem
that
> the real I/O storm is triggered by fsync() not write(), and we don't
> have a way to spread out the consequences of fsync().
We don't have a way to force writes before the fsync, but early writes
to the file system encourages it.  After reading this thread, I'm
tempted to nudge our settings a little higher -- especially the
percentages.  How much overhead is there in checking whether buffer
pages are dirty?
-Kevin

Re: Load distributed checkpoint

From

"Kevin Grittner"

Date:

08 December 2006, 13:55:32

>>> On Fri, Dec 8, 2006 at 11:01 AM, in message
<4579461A.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote: 
I'm going to correct a previous statement, based on having just chatted
with a member of the web team.
The remaining dribble of these was only occurring on the Windows
machines.  (We were running two servers from Windows and two from Linux
for a while to see how things compared.)  The settings I previously
posted have completely eliminated the problem on Linux, and we are
moving all servers to Linux, so this has become a total non-issue for
us.  I really encourage people to try some much more aggressive
background writer settings, and possibly try tuning the OS dirty write
settings.
Unless PostgreSQL gets much closer to the hardware than the community
consensus seems to support, I don't understand what you could do in the
checkpoint phase that would improve on that.  (That, of course, doesn't
mean I'm not missing something, just that the arguments made so far
haven't shown me that the suggested  changes would do anything but move
the problem around a little bit.)
-Kevin

Re: Load distributed checkpoint

From

Brad Nicholson

Date:

08 December 2006, 14:19:26

On Fri, 2006-12-08 at 11:55 -0600, Kevin Grittner wrote:
> >>> On Fri, Dec 8, 2006 at 11:01 AM, in message
> <4579461A.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
> <Kevin.Grittner@wicourts.gov> wrote: 
>  
> I'm going to correct a previous statement, based on having just chatted
> with a member of the web team.
>  
> The remaining dribble of these was only occurring on the Windows
> machines.  (We were running two servers from Windows and two from Linux
> for a while to see how things compared.)  The settings I previously
> posted have completely eliminated the problem on Linux, and we are
> moving all servers to Linux, so this has become a total non-issue for
> us.  I really encourage people to try some much more aggressive
> background writer settings, and possibly try tuning the OS dirty write
> settings.

How much increased I/O usage have you seen in regular operation with
those settings?


-- 
Brad Nicholson  416-673-4106
Database Administrator, Afilias Canada Corp.

Re: Load distributed checkpoint

From

"Kevin Grittner"

Date:

08 December 2006, 14:30:24


>>> On Fri, Dec 8, 2006 at 12:18 PM, in message
<1165601938.10248.56.camel@dba5.int.libertyrms.com>, Brad Nicholson
<bnichols@ca.afilias.info> wrote: 
> 
> How much increased I/O usage have you seen in regular operation with
> those settings?

We have not experience any increase in I/O, just a smoothing.  Keep in
mind that the file system cache will collapse repeated writes to the
same location until things settle, and the controller's cache also has a
chance of doing so.  If we just push dirty pages out to the OS as soon
as possible, and let the file system do its job, I think we're in better
shape than if we try to micro-manage it within our buffer pages.
You mileage may vary of course, but I'm curious whether any real world
production examples exist where this approach is a loser.
-Kevin

Re: Load distributed checkpoint

From

"Inaam Rana"

Date:

10 December 2006, 10:49:11

Unless PostgreSQL gets much closer to the hardware than the community
consensus seems to support, I don't understand what you could do in the
checkpoint phase that would improve on that. (That, of course, doesn't
mean I'm not missing something, just that the arguments made so far
haven't shown me that the suggested changes would do anything but move
the problem around a little bit.)

If we can try out things like O_SYNC and/or O_DIRECT that will give us exact information about dirty pages that need be written to the disk at any given point in time. We can then use this information to design light weight checkpoints.

Re: Load distributed checkpoint

From

"Simon Riggs"

Date:

11 December 2006, 05:31:35

On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:
> I understand that checkpoints occur during crash
> recovery and PITR, so time for those operations would get longer. 

A restorepoint happens during recovery, not a checkpoint. The recovery
is merely repeating the work of the checkpoint that occurred in the
original WAL stream. Elongating the checkpoint would not have any effect
on a restorepoint: we only record the checkpoint when it is complete and
we only create a restorepoint when we see the checkpoint record.

Crash recovery and PITR use almost exactly the same code path (by
design), so there isn't anything special to say about PITR either.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

11 December 2006, 06:38:19

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:

> We have not experience any increase in I/O, just a smoothing.  Keep in
> mind that the file system cache will collapse repeated writes to the
> same location until things settle, and the controller's cache also has a
> chance of doing so.  If we just push dirty pages out to the OS as soon
> as possible, and let the file system do its job, I think we're in better
> shape than if we try to micro-manage it within our buffer pages.

Maybe we have two entirely different tuning approaches: 1. Retain dirty buffers in database, and keep OS buffers clean.
2.Keep database clean, and entrust OS to manage dirty buffers.

I suggested the 1st one, and you did the 2nd. Bottle-neck in checkpoints
vary in the approaches; write() will be worse in 1st, fsync() in 2nd.

Distributed write() is easier than distributed fsync(), because we can
use write() on a page basis, but fsync() only on a file basis.
Also, database has own access-frequency information for its buffers,
so I think 1st approach behaves better in handling re-dirty of buffers.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

11 December 2006, 08:11:10

Hello,

From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:
>> I'm afraid it is difficult for system designers to expect steady
>> throughput/response time, as long as PostgreSQL depends on the
>> flushing of file system cache. How does Oracle provide stable
>> performance?
>> Though I'm not sure, isn't it the key to use O_SYNC so that
write()s
>> transfer data to disk?
>
> AFAIK, other databases use write() and fsync() in combination. They
call
> fsync() immediately after they write buffers in some small batches.
Otherwise,
> they uses asynchronous and direct I/O options. Therefore, dirty
pages in
> kernel buffers are keeped to be low at any time.


Oracle seems to use O_SYNC. I've found a related page in the Oracle
manuals.

--------------------------------------------------
http://download-west.oracle.com/docs/cd/B19306_01/win.102/b15688/ap_unix.htm
Direct Writes to Disk
On both UNIX and Windows platforms, bypassing the file system buffer
cache ensures data is written to disk.
On UNIX, Oracle Database uses the O_SYNC flag to bypass the file
system buffer cache. The flag name depends on the UNIX port.
On Windows, Oracle Database bypasses the file system buffer cache
completely.
--------------------------------------------------

As Itagaki-san says, asynchronous+direct I/O provides best
performance, I believe. Oracle supplies the combination as follows:

--------------------------------------------------
http://download-west.oracle.com/docs/cd/B19306_01/server.102/b15658/appc_linux.htm#sthref870
Oracle Database supports kernel asynchronous I/O. This feature is
disabled by default.
By default, the DISK_ASYNCH_IO initialization parameter in the
parameter file is set to TRUE to enable asynchronous I/O on raw
devices. To enable asynchronous I/O on file system files:
Ensure that all Oracle Database files are located on file systems that
support asynchronous I/O.
Set the FILESYSTEMIO_OPTIONS initialization parameter in the parameter
file to one of the following values:

Linux Distribution Recommended Value
SUSE Linux Enterprise Server 9 SETALL
Other distributions ASYNCH
--------------------------------------------------

I believe SQL Server also uses direct+asynchronous I/O, because
Microsoft recommends in MSDN that the combination plus appropriate
multi-threading provides best performance.

I tested Oracle9i on RHEL 2.1. I straced DBWn (database writer, which
is like the bgwriter of PostgreSQL) while creating tables, indexes,
etc. and shutting down the database server. Oracle surely uses the
O_SYNC as follows, but it doesn't use fsync().

24462 open("/work4/ora/tuna/users01.dbf", O_RDWR|O_SYNC|O_LARGEFILE) =
16

I wonder how the other big DBMS, IBM DB2, handles this. Is Itagaki-san
referring to DB2?

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

11 December 2006, 08:15:40

Mr. Riggs,

Thank you for teaching me the following. I seem to have misunderstood.
I'll learn more.

From: "Simon Riggs" <simon@2ndquadrant.com>
> On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:
>> I understand that checkpoints occur during crash
>> recovery and PITR, so time for those operations would get longer.
>
> A restorepoint happens during recovery, not a checkpoint. The
recovery
> is merely repeating the work of the checkpoint that occurred in the
> original WAL stream. Elongating the checkpoint would not have any
effect
> on a restorepoint: we only record the checkpoint when it is complete
and
> we only create a restorepoint when we see the checkpoint record.


Regards,

----- Original Message ----- 
From: "Simon Riggs" <simon@2ndquadrant.com>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
Cc: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>;
<pgsql-hackers@postgresql.org>
Sent: Monday, December 11, 2006 6:30 PM
Subject: Re: [HACKERS] Load distributed checkpoint


> On Fri, 2006-12-08 at 11:05 +0900, Takayuki Tsunakawa wrote:
>> I understand that checkpoints occur during crash
>> recovery and PITR, so time for those operations would get longer.
>
> A restorepoint happens during recovery, not a checkpoint. The
recovery
> is merely repeating the work of the checkpoint that occurred in the
> original WAL stream. Elongating the checkpoint would not have any
effect
> on a restorepoint: we only record the checkpoint when it is complete
and
> we only create a restorepoint when we see the checkpoint record.
>
> Crash recovery and PITR use almost exactly the same code path (by
> design), so there isn't anything special to say about PITR either.
>
> -- 
>  Simon Riggs
>  EnterpriseDB   http://www.enterprisedb.com
>
>
>

Re: Load distributed checkpoint

From

"Inaam Rana"

Date:

11 December 2006, 08:48:34

I wonder how the other big DBMS, IBM DB2, handles this. Is Itagaki-san
referring to DB2?

DB2 would also open data files with O_SYNC option and page_cleaners (counterparts of bgwriter) would exploit AIO if available on the system.

Inaam Rana
EnterpriseDB http://www.enterprisedb.com

Re: Load distributed checkpoint

From

Ron Mayer

Date:

11 December 2006, 17:32:13

ITAGAKI Takahiro wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote:
> 
>> ...the file system cache will collapse repeated writes to the
>> same location until things settle ...
>> If we just push dirty pages out to the OS as soon as possible, 
>> and let the file system do its job, I think we're in better
>> shape...
> 
> Maybe we have two entirely different tuning approaches:
>   1. Retain dirty buffers in database, and keep OS buffers clean.
>   2. Keep database clean, and entrust OS to manage dirty buffers.
> 
> I suggested the 1st one, and you did the 2nd. Bottle-neck in checkpoints
> vary in the approaches; write() will be worse in 1st, fsync() in 2nd.

The fsync() won't necessarily be worse in the second approach.  OS's have
quite a few tunable parameters that can encourage the system to physically
write the pending write()s before the fsync() - either in the background
or by the process doing the write() itself when there are too many
dirty pages.

For checkpoints, I think the main question is whether postgresql's
background writer is smarter or less smart than pdflush or the
equivalent on your system for database workloads.

> Also, database has own access-frequency information for its buffers,
> so I think 1st approach behaves better in handling re-dirty of buffers.

I'm curious what access-frequency info the OS and the database has.

One thing I do worry about is if both postgresql and the OS
are both delaying write()s in the hopes of collapsing them
at the same time.  If so, this would cause both to be experience
bigger delays than expected, and make checkpoints worse.

I'm guesing if you use approach 1. you might be better off
turning down the amount of buffering that the OS does with
dirty pages - and if you use approach 2, you might be better
off turning down the amount of delays that postgresql adds.

Re: Load distributed checkpoint

From

"Kevin Grittner"

Date:

11 December 2006, 17:58:34

>>> On Mon, Dec 11, 2006 at  3:31 PM, in message
<457DCE4D.7090402@cheapcomplexdevices.com>, Ron Mayer
<rm_pg@cheapcomplexdevices.com> wrote: 
> 
> One thing I do worry about is if both postgresql and the OS
> are both delaying write()s in the hopes of collapsing them
> at the same time.  If so, this would cause both to be experience
> bigger delays than expected, and make checkpoints worse.
That is my concern.  Letting 30 seconds worth of dirty pages accumulate
between checkpoints and then trying to smooth the writes within
checkpoint code seems like a doomed effort.  Either we take over control
of everything, like many other products, and schedule it all ourselves,
or we push it out to the file system fairly quickly and let it do its
job.  Let's not stand with one foot on the pier and one on the boat.
It is entirely possible that the current setup, where a delay of a
couple seconds may collapse several OS API calls, could have a
performance benefit; but, let's be clear that we're talking about saving
that overhead, not disk I/O.
-Kevin

Re: Load distributed checkpoint

From

"Zeugswetter Andreas ADI SD"

Date:

12 December 2006, 05:22:17

> > One thing I do worry about is if both postgresql and the OS
> > are both delaying write()s in the hopes of collapsing them
> > at the same time.  If so, this would cause both to be experience
> > bigger delays than expected, and make checkpoints worse.
>
> That is my concern.  Letting 30 seconds worth of dirty pages
accumulate
> between checkpoints and then trying to smooth the writes within
> checkpoint code seems like a doomed effort.

30 seconds ??? You are supposed to avoid excessive checkpoints.
If you are reducing checkpoint_timeout to avoid the spike, you are imho
definitely tuning the wrong knob. You are supposed to increase
checkpoint_timeout as far as you can to still have an acceptable
recovery delay after a crash.

If you cannot afford a huge spike during checkpoint, say every 30
minutes,
you have to make bgwriter more aggressive. This would generally be true
for
both of the variants, db driven direct io and buffered filesystem io.

Andreas

Re: Load distributed checkpoint

From

"Kevin Grittner"

Date:

12 December 2006, 11:04:10

>>> On Tue, Dec 12, 2006 at  3:22 AM, in message
<E1539E0ED7043848906A8FF995BDA5790198ECC8@m0143.s-mxs.net>,
"Zeugswetter
Andreas ADI SD" <ZeugswetterA@spardat.at> wrote: 
>> > One thing I do worry about is if both postgresql and the OS
>> > are both delaying write()s in the hopes of collapsing them
>> > at the same time.  If so, this would cause both to be experience
>> > bigger delays than expected, and make checkpoints worse.
>>  
>> That is my concern.  Letting 30 seconds worth of dirty pages
> accumulate
>> between checkpoints and then trying to smooth the writes within
>> checkpoint code seems like a doomed effort.
> 
> 30 seconds ??? You are supposed to avoid excessive checkpoints.
> If you are reducing checkpoint_timeout to avoid the spike, you are
imho
> definitely tuning the wrong knob.
Sorry for the fuzzy language -- I was intending to describe a situation
where dirty pages accumulate by checkpoint time which would take 30
seconds to write to disk.  We were into this situation (and worse) with
the default bgwriter settings.
> you have to make bgwriter more aggressive.
This is what I've been saying.
I've also been saying that if the PostgreSQL way is to let the file
system handle the caching and I/O scheduling, we should trust it to know
what to do with dirty pages, and not try to second-guess it.  (Of course
there are knobs to tune the file system if needed.)  Our checkpoint
performance issues went away when we went to settings which basically
never leave a dirty page hidden from the file system for more than two
seconds.
-Kevin

Re: Load distributed checkpoint

From

Bruce Momjian

Date:

12 December 2006, 13:17:21

I have thought a while about this and I have some ideas.

Ideally, we would be able to trickle the sync of individuals blocks
during the checkpoint, but we can't because we rely on the kernel to
sync all dirty blocks that haven't made it to disk using fsync().  We
could trickle the fsync() calls, but that just extends the amount of
data we are writing that has been dirtied post-checkpoint.  In an ideal
world, we would be able to fsync() only part of a file at a time, and
only those blocks that were dirtied pre-checkpoint, but I don't see that
happening anytime soon (and one reason why many commercial databases
bypass the kernel cache).

So, in the real world, one conclusion seems to be that our existing
method of tuning the background writer just isn't good enough for the
average user:
#bgwriter_delay = 200ms                 # 10-10000ms between rounds#bgwriter_lru_percent = 1.0             # 0-100% of
LRUbuffers scanned/round#bgwriter_lru_maxpages = 5              # 0-1000 buffers max written/round#bgwriter_all_percent
=0.333           # 0-100% of all buffers scanned/round#bgwriter_all_maxpages = 5              # 0-1000 buffers max
written/round

These settings control what the bgwriter does, but they do not clearly
relate to the checkpoint timing, which is the purpose of the bgwriter,
and they don't change during the checkpoint interval, which is also less
than ideal.  If set to aggressively, it writes too much, and if too low,
the checkpoint does too much I/O.

We clearly need more bgwriter activity as the checkpoint approaches, and
one that is more auto-tuned, like many of our other parameters.  I think
we created these settings to see how they worked in the field, so it
probably time to reevaluate them based on field reports.

I think the bgwriter should keep track of how far it is to the next
checkpoint, and use that information to increase write activity. 
Basically now, during a checkpoint, the bgwriter does a full buffer scan
and fsync's all dirty files, so it changes from the configuration
parameter-defined behavior right to 100% activity.  I think it would be
ideal if we could ramp up the writes so that when it is 95% to the next
checkpoint, it can be operating at 95% of the activity it would do
during a checkpoint.

My guess is if we can do that, we will have much smoother performance
because we have more WAL writes just after checkpoint for newly-dirtied
pages, and the new setup will give us more write activity just before
checkpoint.

One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
the kernel cache, so we are sure data will be on disk by checkpoint
time.  This was avoided in the past because of the expense of
second-guessing the kernel disk I/O scheduling algorithms.

---------------------------------------------------------------------------

Tom Lane wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > "Jim C. Nasby" <jim@nasby.net> wrote: 
> >> Generally, I try and configure the all* settings so that you'll get 1
> >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
> >> have any actual tests to back that methodology up.
> 
> > We got to these numbers somewhat scientifically.  I studied I/O
> > patterns under production load and figured we should be able to handle
> > about 800 writes in per 200 ms without causing problems.  I have to
> > admit that I based the percentages and the ratio between "all" and "lru"
> > on gut feel after musing over the documentation.
> 
> I like Kevin's settings better than what Jim suggests.  If the bgwriter
> only makes one sweep between checkpoints then it's hardly going to make
> any impact at all on the number of dirty buffers the checkpoint will
> have to write.  The point of the bgwriter is to reduce the checkpoint
> I/O spike by doing writes between checkpoints, and to have any
> meaningful impact on that, you'll need it to make the cycle several times.
> 
> Another point here is that you want checkpoints to be pretty far apart
> to minimize the WAL load from full-page images.  So again, a bgwriter
> that's only making one loop per checkpoint is not gonna be doing much.
> 
> I wonder whether it would be feasible to teach the bgwriter to get more
> aggressive as the time for the next checkpoint approaches?  Writes
> issued early in the interval have a much higher probability of being
> wasted (because the page gets re-dirtied later).  But maybe that just
> reduces to what Takahiro-san already suggested, namely that
> checkpoint-time writes should be done with the same kind of scheduling
> the bgwriter uses outside checkpoints.  We still have the problem that
> the real I/O storm is triggered by fsync() not write(), and we don't
> have a way to spread out the consequences of fsync().
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Load distributed checkpoint

From

Gregory Stark

Date:

12 December 2006, 13:38:51

> Tom Lane wrote:
>> 
>> I like Kevin's settings better than what Jim suggests.  If the bgwriter
>> only makes one sweep between checkpoints then it's hardly going to make
>> any impact at all on the number of dirty buffers the checkpoint will
>> have to write.  The point of the bgwriter is to reduce the checkpoint
>> I/O spike by doing writes between checkpoints, and to have any
>> meaningful impact on that, you'll need it to make the cycle several times.
>> 
>> Another point here is that you want checkpoints to be pretty far apart
>> to minimize the WAL load from full-page images.  So again, a bgwriter
>> that's only making one loop per checkpoint is not gonna be doing much.

I missed the previous message but it sounds like you're operating under a
different set of assumptions than the original poster. If you do a single
sweep through all of the buffers *and sync them* then you've just finished a
checkpoint -- the *previous* checkpoint. Not the subsequent one.

That is, rather than trying to spread the load of the checkpoint out by
getting the writes into the kernel sooner but make no attempt to sync them
until checkpoint time, start the checkpoint as soon as the previous checkpoint
finishes, and dribble the blocks of the checkpoint out slowly throughout an
entire checkpoint cycle syncing them immediately using O_SYNC/ODIRECT.

It's a fundamental shift in the idea of the purpose of bgwriter. Instead of
trying to suck i/o away from the subsequent checkpoint it would be responsible
for all the i/o of the previous checkpoint which would still be in progress
for the entire time of checkpoint_timeout. It would only complete when
bgwriter had finished doing its one full sweep.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Load distributed checkpoint

From

Tom Lane

Date:

12 December 2006, 13:48:56

Gregory Stark <stark@enterprisedb.com> writes:
> It's a fundamental shift in the idea of the purpose of bgwriter. Instead of
> trying to suck i/o away from the subsequent checkpoint it would be responsible
> for all the i/o of the previous checkpoint which would still be in progress
> for the entire time of checkpoint_timeout. It would only complete when
> bgwriter had finished doing its one full sweep.

I think that's basically the same as the original suggestion, which is
to do checkpoints using less than the full I/O bandwidth of the machine
--- tying it exactly to the default bgwriter rate may or may not be
appropriate.

The difficulty with such schemes is that if you go over to using O_DIRECT
writes instead of fsync in the bgwriter, it's hard to avoid doing the
same when a random backend has to write a dirty buffer --- yet you'd
really rather that such a backend not have to wait for the ensuing I/O.
And this gets a lot worse if checkpoints are slowed down, because it gets
more likely that the bufmgr will run out of clean buffers and have to do
a write() from a backend.
        regards, tom lane

Re: Load distributed checkpoint

From

"Jim C. Nasby"

Date:

13 December 2006, 02:32:55

On Fri, Dec 08, 2006 at 11:43:27AM -0500, Tom Lane wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > "Jim C. Nasby" <jim@nasby.net> wrote: 
> >> Generally, I try and configure the all* settings so that you'll get 1
> >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
> >> have any actual tests to back that methodology up.
> 
> > We got to these numbers somewhat scientifically.  I studied I/O
> > patterns under production load and figured we should be able to handle
> > about 800 writes in per 200 ms without causing problems.  I have to
> > admit that I based the percentages and the ratio between "all" and "lru"
> > on gut feel after musing over the documentation.
> 
> I like Kevin's settings better than what Jim suggests.  If the bgwriter
> only makes one sweep between checkpoints then it's hardly going to make
> any impact at all on the number of dirty buffers the checkpoint will
> have to write.  The point of the bgwriter is to reduce the checkpoint
> I/O spike by doing writes between checkpoints, and to have any
> meaningful impact on that, you'll need it to make the cycle several times.

It would be good if the docs included more detailed info on how exactly
the bgwriter goes about flushing stuff to disk. You can certainly read
them and think that the bgwriter just goes through and issues writes for
any dirty buffers it finds. Though, looking at BgBufferSync, I think it
actually does write out pages during the "all" scan, regardless of what
usage_count says.

> I wonder whether it would be feasible to teach the bgwriter to get more
> aggressive as the time for the next checkpoint approaches?  Writes
> issued early in the interval have a much higher probability of being
> wasted (because the page gets re-dirtied later).  But maybe that just
> reduces to what Takahiro-san already suggested, namely that
> checkpoint-time writes should be done with the same kind of scheduling
> the bgwriter uses outside checkpoints.  We still have the problem that
> the real I/O storm is triggered by fsync() not write(), and we don't
> have a way to spread out the consequences of fsync().

Would the ramp-up of write activity push the kernel to actually write
stuff? My understanding is that most OSes have a time limit on how long
they'll let a write-request sit in cache, so ISTM a better way to smooth
out disk IO is to write things in a steady stream.

If the bgwriter takes the buffer access counter into account when
deciding what to write out, it might make sense to write more recently
accessed pages as checkpoint nears. The idea being that odds are good
those buffers are about to get flushed by BufferSync() anyway.

Also, I have a dumb question... BgBufferSync uses buf_id1 to keep track
of what buffer the bgwriter_all scan is looking at, which means that
it should remember where it was at the end of the last scan; yet it's
initialized to 0 every time BgBufferSync is called. Is there someplace
else that is remembering where the complete scan is leaving off when
bgwriter_all_percent or bgwriter_all_maxpages is hit? Or does the scan
in fact just keep re-scanning the beginning of the buffers?
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

13 December 2006, 05:20:32

Hello,

From: "Jim C. Nasby" <jim@nasby.net>Also, I have a dumb question... BgBufferSync uses buf_id1 to keep
track
> of what buffer the bgwriter_all scan is looking at, which means that
> it should remember where it was at the end of the last scan; yet
it's
> initialized to 0 every time BgBufferSync is called. Is there
someplace
> else that is remembering where the complete scan is leaving off when
> bgwriter_all_percent or bgwriter_all_maxpages is hit? Or does the
scan
> in fact just keep re-scanning the beginning of the buffers?

No. BgBufferSync() correctly keeps track of the position to restart
scanning at. bufid1 is not initialized to 0 every time BgBufferSync()
is called, because bufid1 is a static local variable. Please see the
following code. It prints:

a=0
a=1
a=2


#include <stdio.h>

void func(void)
{static int a = 0;
printf("a=%d\n", a);a++;
}

int main(void)
{func();func();func();
 return 0;
}

Re: Load distributed checkpoint

From

"Jim C. Nasby"

Date:

13 December 2006, 15:34:25

On Wed, Dec 13, 2006 at 06:27:38PM +0900, Takayuki Tsunakawa wrote:
> No. BgBufferSync() correctly keeps track of the position to restart
> scanning at. bufid1 is not initialized to 0 every time BgBufferSync()
> is called, because bufid1 is a static local variable. Please see the
> following code. It prints:

Doh, I completely missed the static part of the declaration. Thanks for
the C tutorial. :)
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

18 December 2006, 01:43:10

Hello, Itagaki-san, all

Sorry for my long mail.  I've had trouble in sending this mail because
it's too long for pgsql-hackers to accept (I couldn't find how large
mail is accepted.)  So I'm trying to send several times.
Please see the attachment for the content.

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

19 December 2006, 05:31:33

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

> I performed some simple tests, and I'll show the results below.

> (1) The default case
> 235  80  226 77  240
> (2) No write case
> 242  250  244  253  280
> (3) No checkpoint case
> 229  252  256  292  276
> (4) No fsync() case
> 236  112  215  216  221
> (5) No write by PostgreSQL, but fsync() by another program case
> 9  223  260  283  292
> (6) case (5) + O_SYNC by write_fsync
> 97  114  126  112  125
> (7) O_SYNC case
> 182  103  41  50  74

I posted a patch to PATCHES. Please try out it.
It does write() smoothly, but fsync() at a burst.
I suppose the result will be between (3) and (5).

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

19 December 2006, 23:59:32

Hello, Itagaki-san

> I posted a patch to PATCHES. Please try out it.

Really!?  I've just joined pgsql-patches.  When did you post it,
yesterday?  I couldn't find the patch in the following page which
lists the mails to pgsql-patches of this month:

http://archives.postgresql.org/pgsql-patches/2006-12/index.php

Could you send me the patch if it has not already registered on the
above page.  I want to try the patch by all means, because smoothing
response times is very important so that we can recommend PostgreSQL
to system designers.  I really wish your patch will be the real
solution.

> It does write() smoothly, but fsync() at a burst.
> I suppose the result will be between (3) and (5).

Hmm...  I think some logical reasoning is needed to get the
understanding from community members (sorry if the community members
have already agreed.)  Excuse me for repeating myself, but I'm afraid
fsync() will be the evil sometime in some environments.  Success in
one test environment is not the real success.

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

20 December 2006, 07:04:17

Hello, Itagaki-san, all

I have to report a sad result.  Your patch didn't work.  Let's
consider the solution together.  What you are addressing is very
important for the system designers in the real world -- smoothing
response time.

Recall that unpatched PostgreSQL showed the following tps's in case
(1) (i.e. with default bgwriter_* and checkpoint_* settings.)

235  80  226 77  240

The patched PostgreSQL showed the following tps's:

230  228  77  209  66

[disk usage]
The same tendency can be seen as with the unpatched PostgreSQL.  That
is:
When the tps is low, the %util of disk for data files is high, and
%util of disk for WAL is low.  Why is transaction logging is disturbed
by cleaning and/or syncing activity?  While the bgwriter is
fsync()ing, it does not lock any data structures that the transactions
want to access.  Even though they share the same SCSI controller and
bus, they are different disks.  The bandwidth does not appear to be
exhausted, since Ultra320 is said to have 256MB band width in
practice.
(Recall that WAL is on sdd and data files are on sde.)

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sdd          0.00 810.78  0.00 102.20    0.00 7306.99     0.00
3653.49    71.50     1.12   10.95   7.32  74.77
sde          0.00  25.35  0.00  6.19    0.00  252.30     0.00   126.15
40.77     0.50   81.32   5.94   3.67
sdd          0.00 884.20  0.00 126.00    0.00 8080.00     0.00
4040.00    64.13     1.26   10.00   7.11  89.64
sde          0.00  21.40  0.00  5.00    0.00  211.20     0.00   105.60
42.24     0.31   62.56   6.52   3.26
sdd          0.00 924.80  0.00 116.20    0.00 8326.40     0.00
4163.20    71.66     1.23   10.59   7.37  85.64
sde          0.00  27.60  0.00 26.60    0.00  433.60     0.00   216.80
16.30     4.24  159.29   2.44   6.50
sdd          0.00 721.20  0.00 102.40    0.00 6588.80     0.00
3294.40    64.34     0.99    9.71   7.07  72.40
sde          0.00 1446.80  0.00 101.60    0.00 20289.60     0.00
10144.80   199.70  1192.40  572.45   2.29  23.30
sdd          0.00   0.00  0.00  0.20    0.00    1.60     0.00     0.80
8.00     0.11  539.00 539.00  10.80
sde          0.00   0.00  0.00 452.10    0.00    0.00     0.00
0.00     0.00  3829.57 3715.83   2.22 100.22
sdd          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sde          0.00   0.00  0.00 349.80    0.00    0.00     0.00
0.00     0.00  1745.52 8515.74   2.86 100.02
sdd          0.00 442.40  0.00 51.00    0.00 3948.80     0.00  1974.40
77.43     0.60   11.73   7.54  38.46
sde          0.00   2.80  0.00 184.00    0.00   25.60     0.00
12.80     0.14   277.52 12629.41   3.19  58.74
sdd          0.00 898.00  0.00 124.80    0.00 8182.40     0.00
4091.20    65.56     1.30   10.40   7.24  90.30
sde          0.00  19.20  0.00  3.80    0.00  184.00     0.00    92.00
48.42     0.24   62.11  14.11   5.36
sdd          0.00 842.28  0.00 109.02    0.00 7612.02     0.00
3806.01    69.82     1.33   12.26   8.35  91.02
sde          0.00  45.49  0.00 46.89    0.00  739.08     0.00   369.54
15.76     9.04  192.73   3.38  15.85
sdd          0.00 1198.41  0.00 71.51    0.00 10505.18     0.00
5252.59   146.90   128.19   99.76  13.48  96.43
sde          0.00 1357.77  0.00 199.80    0.00 19263.75     0.00
9631.87    96.41  2251.09 1179.42   2.39  47.81
sdd          0.00   0.00  0.00  7.20    0.00    0.00     0.00     0.00
0.00   203.87 5671.83 138.92 100.02
sde          0.00   0.00  0.00 409.60    0.00    0.00     0.00
0.00     0.00  3171.04 4779.83   2.44 100.02
sdd          0.00   0.00  0.00 17.80    0.00    0.00     0.00     0.00
0.00   137.87 10240.90  56.19 100.02
sde          0.00   0.00  0.00 240.60    0.00    0.00     0.00
0.00     0.00  1573.85 9815.29   4.16 100.02
sdd          0.00 109.80  0.00 35.40    0.00 1012.80     0.00   506.40
28.61    42.14 7974.47  27.86  98.64
sde          0.00   2.80  0.00 198.80    0.00   30.40     0.00
15.20     0.15   428.49 14474.39   4.30  85.56
sdd          0.00 466.20  0.00 62.80    0.00 4230.40     0.00  2115.20
67.36     0.59    9.49   6.79  42.62
sde          0.00   5.20  0.00  0.80    0.00   48.00     0.00    24.00
60.00     0.01   16.25  11.25   0.90
sdd          0.00   0.00  0.00  0.20    0.00    1.60     0.00     0.80
8.00     0.01   35.00  35.00   0.70
sde          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00


I suspect that fsync() is the criminal as I've been afraid.  I'll show
you an interesting data.  I collected the stack traces of backend
processes while a checkpoint is happening.


[bgwriter]
Oh, he is fsync()ing hard.

#0  0x0000003a629bfbb2 in __fsync_nocancel () from
/lib64/tls/libc.so.6
#1  0x00000000005742a1 in mdsync ()
#2  0x00000000005753d7 in smgrsync ()
#3  0x0000000000564d65 in FlushBufferPool ()
...

[some backends]
They are forced to wait for some lock pertaining to WAL when they try
to insert a log record.

#0  0x0000003a629c7b79 in semop () from /lib64/tls/libc.so.6
#1  0x000000000054c9c6 in PGSemaphoreLock ()
#2  0x0000000000572374 in LWLockAcquire ()
#3  0x000000000046ab10 in XLogInsert ()

[some backends]
They are waiting for some lock when they commit.

#0  0x0000003a629c7b79 in semop () from /lib64/tls/libc.so.6
#1  0x000000000054c9c6 in PGSemaphoreLock ()
#2  0x0000000000572374 in LWLockAcquire ()
#3  0x000000000046a746 in XLogFlush ()
#4  0x000000000046228e in RecordTransactionCommit ()

[the second criminal backend]
He is making other backends wait for WALWRITE lock, because he is
doing a long-time operation - fdatasync() - with the WALWRITE lock
being held.
But I think he is also a victim.  He is probably forced to wait by
bgwriter's fsync() activity.

#0  0x0000003a629bfc59 in fdatasync () from /lib64/tls/libc.so.6
#1  0x0000000000469ac0 in issue_xlog_fsync ()
#2  0x000000000046a210 in XLogWrite ()
#3  0x000000000046a7df in XLogFlush ()
#4  0x000000000046228e in RecordTransactionCommit ()

The following stack trace sometimes appears as the second criminal
instead of the above:

#0  0x0000003a629bfbb2 in __fsync_nocancel () from
/lib64/tls/libc.so.6
#1  0x000000000046530e in UpdateControlFile ()
#2  0x000000000046a05d in XLogWrite ()
#3  0x000000000046a7df in XLogFlush ()
#4  0x000000000046228e in RecordTransactionCommit ()


[Conclusion]
I believe that the problem cannot be solved in a real sense by
avoiding fsync/fdatasync().  We can't ignore what commercial databases
have done so far.  The kernel does as much as he likes when PostgreSQL
requests him to fsync().
One question is the disk utilization.  While bgwriter is fsync()ing,
%util of WAL disk drops to almost 0.  But the the bandwidth of
Ultra320 SCSI does not appear to be used fully.  Any idea?
So what should we do next?

Re: Load distributed checkpoint

From

Martijn van Oosterhout

Date:

20 December 2006, 07:17:43

On Wed, Dec 20, 2006 at 08:10:56PM +0900, Takayuki Tsunakawa wrote:
> One question is the disk utilization.  While bgwriter is fsync()ing,
> %util of WAL disk drops to almost 0.  But the the bandwidth of
> Ultra320 SCSI does not appear to be used fully.  Any idea?

That implies that fsyncing a datafile blocks fsyncing the WAL. That
seems terribly unlikely (although...). What OS/Kernel/Filesystem is
this. I note a sync bug in linux for ext3 that may have relevence.

Have a nice day,

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

20 December 2006, 08:08:17

> That implies that fsyncing a datafile blocks fsyncing the WAL. That
> seems terribly unlikely (although...). What OS/Kernel/Filesystem is
> this. I note a sync bug in linux for ext3 that may have relevence.

Oh, really?  What bug?  I've heard that ext3 reports wrong data to
iostat when it performs writes (the data is correct when performing
reads.)
My env is:

OS: RHEL 4.0 for AMD64/EM64T
kernel: 2.6.9-42.ELsmp
The file system is ext3.

Terribly unlikely?  But I've seen the disk utilization quite often.

Re: Load distributed checkpoint

From

Martijn van Oosterhout

Date:

20 December 2006, 09:49:18

On Wed, Dec 20, 2006 at 09:14:50PM +0900, Takayuki Tsunakawa wrote:
> > That implies that fsyncing a datafile blocks fsyncing the WAL. That
> > seems terribly unlikely (although...). What OS/Kernel/Filesystem is
> > this. I note a sync bug in linux for ext3 that may have relevence.
>
> Oh, really?  What bug?  I've heard that ext3 reports wrong data to
> iostat when it performs writes (the data is correct when performing
> reads.)

I was referring to this in the 2.6.6 changelog:

http://www.linuxhq.com/kernel/changelog/v2.6/6/index.html
  ext3's fsync/fdatasync implementation is currently syncing the inode via a  full journal commit even if it was
unaltered.  
However you're running a later version so that's not it.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Load distributed checkpoint

From

"Inaam Rana"

Date:

20 December 2006, 16:29:42

On 12/20/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:

[Conclusion]
I believe that the problem cannot be solved in a real sense by
avoiding fsync/fdatasync(). We can't ignore what commercial databases
have done so far. The kernel does as much as he likes when PostgreSQL
requests him to fsync().

I am new to the community and am very interested in the tests that you have done. I am also working on resolving the sudden IO spikes at checkpoint time. I agree with you that fsync() is the core issue here.

Being a new member I was wondering if someone on this list has done testing with O_DIRECT and/or O_SYNC for datafiles as that seems to be the most logical way of dealing with fsync() flood at checkpoint time. If so, I'll be very interested in the results. As mentioned in this thread that a single bgwriter with O_DIRECT will not be able to keep pace with cleaning effort causing backend writes. I think (i.e. IMHO) multiple bgwriters and/or AsyncIO with O_DIRECT can resolve this issue.

Talking of bgwriter_* parameters I think we are missing a crucial internal counter i.e. number of dirty pages. How much work bgwriter has to do at each wakeup call should be a function of total buffers and currently dirty buffers. Relying on both these values instead of just one static NBuffers should allow bgwriter to adapt more quickly to workload changes and ensure that not much work is accumulated for checkpoint.

--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

20 December 2006, 20:45:27

On 12/20/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:
> > [Conclusion]
> > I believe that the problem cannot be solved in a real sense by
> > avoiding fsync/fdatasync(). We can't ignore what commercial databases
> > have done so far. The kernel does as much as he likes when PostgreSQL
> > requests him to fsync().

From: Inaam Rana
> I am new to the community and am very interested in the tests that you have done. I am also working on resolving the sudden IO spikes at checkpoint time. I agree with you that fsync() is the core issue here.

Thank you for understanding my bad English correctly. Yes, what I've been insisting is that it is necessary to avoid fsync()/fdatasync() and to use O_SYNC (plus O_DIRECT if supported on the target platform) to really eliminate the big spikes.
In my mail, the following sentence made a small mistake.

"I believe that the problem cannot be solved in a real sense by avoiding fsync/fdatasync()."

The correct sentence is:

"I believe that the problem cannot be solved in a real sense without avoiding fsync/fdatasync()."

> Being a new member I was wondering if someone on this list has done testing with O_DIRECT and/or O_SYNC for datafiles as that seems to be the most logical way of dealing with fsync() flood at checkpoint time. If so, I'll be very interested in the results.

Could you see the mail I sent on Dec 18? Its content was so long that I zipped the whole content and attached to the mail. I just performed the same test simply adding O_SYNC to open() in mdopen() and another function in md.c. I couldn't succeed in running with O_DIRECT because O_DIRECT requires the shared buffers to be aligned on the sector-size boundary. To perform O_DIRECT test, a little more modification is necessary to the code where the shared buffers are allocated.
The result was bad. But that's just a starting point. We need some improvements that commercial databases have done. I think some approaches we should take are:

(1) two-checkpoint (described in Jim Gray's textbook "Transaction Processing: Concepts and Techniques"
(2) what Oracle suggests in its manual (see my previous mails)
(3) write multiple contiguous buffers with one write() to decrease the count of write() calls

> As mentioned in this thread that a single bgwriter with O_DIRECT will not be able to keep pace with cleaning effort causing backend writes. I think (i.e. IMHO) multiple bgwriters and/or AsyncIO with O_DIRECT can resolve this issue.

I agree with you. Oracle provides a parameter called DB_WRITER_PROCESSES to set the number of database writer processes. Oracle also provides asynchronous I/O to solve the problem you are saying about. Please see section 10.3.9 the following page:

http://download-west.oracle.com/docs/cd/B19306_01/server.102/b14211/instance_tune.htm#sthref1049

> Talking of bgwriter_* parameters I think we are missing a crucial internal counter i.e. number of dirty pages. How much work bgwriter has to do at each wakeup call should be a function of total buffers and currently dirty buffers. Relying on both these values instead of just one static NBuffers should allow bgwriter to adapt more quickly to workload changes and ensure that not much work is accumulated for checkpoint.

I agree with you in the sense that the current bgwriter is a bit careless about the system load. I believe that PostgreSQL should be more gentle to OLTP transactions -- many users of the system as a result. I think the speed of WAL accumulation should also be taken into account. Let's list up the problems and ideas.

--

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

20 December 2006, 22:55:52

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

> I have to report a sad result.  Your patch didn't work.  Let's
> consider the solution together.  What you are addressing is very
> important for the system designers in the real world -- smoothing
> response time.

You were running the test on the very memory-depend machine.
> shared_buffers = 4GB / The scaling factor is 50, 800MB of data.
Thet would be why the patch did not work. I tested it with DBT-2, 10GB of
data and 2GB of memory. Storage is always the main part of performace here,
even not in checkpoints.

If you use Linux, it has very unpleased behavior in fsync(); It locks all
metadata of the file being fsync-ed. We have to wait for the completion of
fsync when we do read(), write(), and even lseek().

Almost of your data is in the accounts table and it was stored in a single
file. All of transactions must wait for fsync to the single largest file,
so you saw the bottleneck was in the fsync.

> [Conclusion]
> I believe that the problem cannot be solved in a real sense by
> avoiding fsync/fdatasync().

I think so, too. However, I assume we can resolve a part of the
checkpoint spikes with smoothing of write() alone.

BTW, can we use the same way to fsync? We call fsync()s to all modified
files without rest in mdsync(), but it's not difficult at all to insert
sleeps between fsync()s. Do you think it helps us? One of issues is that
we have to sleep in file unit, which is maybe rough granularity.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

21 December 2006, 04:07:22

From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
> You were running the test on the very memory-depend machine.
>> shared_buffers = 4GB / The scaling factor is 50, 800MB of data.
> Thet would be why the patch did not work. I tested it with DBT-2,
10GB of
> data and 2GB of memory. Storage is always the main part of
performace here,
> even not in checkpoints.

Yes, I used half the size of RAM as the shared buffers, which is
reasonable.  And I cached all the data.  The effect of fsync() is a
heavier offence, isn't it?  System administrators would say "I have
enough memory.  The data hasn't exhausted the DB cache yet.  But the
users complain to me about the response.  Why?  What should I do?
What?  Checkpoint??  Why doesn't PostgreSQL take care of frontend
users?"
BTW, is DBT-2 an OLTP benchmark which randomly access some parts of
data, or a batch application which accesses all data?  I'm not
familiar with it.  I know that IPA opens it to the public.

> If you use Linux, it has very unpleased behavior in fsync(); It
locks all
> metadata of the file being fsync-ed. We have to wait for the
completion of
> fsync when we do read(), write(), and even lseek().
> Almost of your data is in the accounts table and it was stored in a
single
> file. All of transactions must wait for fsync to the single largest
file,
> so you saw the bottleneck was in the fsync.

Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
waiting for lseek() to complete when it committed.  But why does the
backend which is syncing WAL/pg_control have to wait for syncing the
data file?  They are, not to mention, different files, and WAL and
data files are stored on separate disks.


>> [Conclusion]
>> I believe that the problem cannot be solved in a real sense by
>> avoiding fsync/fdatasync().
>
> I think so, too. However, I assume we can resolve a part of the
> checkpoint spikes with smoothing of write() alone.

First, what's the goal (if possible numerically?  Have you explained
to community members why the patch would help many people?  At least,
I haven't heard that fsync() can be seriously bad and we would close
our eyes to what fsync() does.
By the way, what good results did you get with DBT-2?  If you don't
mind, can you show us?


> BTW, can we use the same way to fsync? We call fsync()s to all
modified
> files without rest in mdsync(), but it's not difficult at all to
insert
> sleeps between fsync()s. Do you think it helps us? One of issues is
that
> we have to sleep in file unit, which is maybe rough granularity.

No, it definitely won't help us.  There is no reason why it will help.
It might help in some limited environments, though, how can we
characterize such environments?  Can we say "our approach helps our
environments, but it won't help you.  The kernel VM settings may help
you.  Good luck!"?
We have to consider seriously.  I think it's time to face the problem
and we should follow the approaches of experts like Jim Gray and DBMS
vendors, unless we have a new clever idea like them.

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

21 December 2006, 05:46:46

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

> > If you use Linux, it has very unpleased behavior in fsync(); It locks all
> > metadata of the file being fsync-ed. We have to wait for the completion of
> > fsync when we do read(), write(), and even lseek().
> 
> Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
> waiting for lseek() to complete when it committed.  But why does the
> backend which is syncing WAL/pg_control have to wait for syncing the
> data file?  They are, not to mention, different files, and WAL and
> data files are stored on separate disks.

Backends call lseek() in planning, so they have to wait fsync() to
the table that they will access. Even if all of data in the file is in
the cache, lseek() conflict with fsync(). You can see a lot of backends
are waiting in planning phase in checkpoints, not executing phase.

> > it's not difficult at all to insert sleeps between fsync()s.

> Can we say "our approach helps our
> environments, but it won't help you.  The kernel VM settings may help
> you.  Good luck!"?

I didn't say such a thing at all.
There are several opinions in the discussion: 1. High bgwriter setting is enough! 2. Change your OS :-) 3. Use O_SYNC
orO_DIRECT, but very poor performance. 4. We may settle for single fsync(), but not many fsync()s in a short time.

I just suggested 4.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

21 December 2006, 06:22:00

To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
Cc: <pgsql-hackers@postgresql.org>
Sent: Thursday, December 21, 2006 6:46 PM
Subject: Re: [HACKERS] Load distributed checkpoint


>
From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
> "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:
>> Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
>> waiting for lseek() to complete when it committed.  But why does
the
>> backend which is syncing WAL/pg_control have to wait for syncing
the
>> data file?  They are, not to mention, different files, and WAL and
>> data files are stored on separate disks.
>
> Backends call lseek() in planning, so they have to wait fsync() to
> the table that they will access. Even if all of data in the file is
in
> the cache, lseek() conflict with fsync(). You can see a lot of
backends
> are waiting in planning phase in checkpoints, not executing phase.

I see.  I found one backend like the following.  But one in my case
one out of 16 backends.  Most of others are waiting to acquire
WALWRITE lock.

#0  0x0000003a629c6902 in __lseek_nocancel () from
/lib64/tls/libc.so.6
#1  0x000000000056789f in FileSeek ()
#2  0x0000000000574053 in mdnblocks ()
#3  0x0000000000574f4a in smgrnblocks ()
#4  0x00000000005489e8 in estimate_rel_size ()
#5  0x0000000000548bee in get_relation_info ()
#6  0x000000000054aa3d in build_simple_rel ()
#7  0x0000000000539c6b in add_base_rels_to_query ()
#8  0x000000000053b955 in query_planner ()
#9  0x000000000053c1c9 in grouping_planner ()
#10 0x000000000053d3b4 in subquery_planner ()
#11 0x000000000053d5b3 in planner ()
#12 0x00000000005778fc in pg_plan_query ()
#13 0x000000000057798c in pg_plan_queries ()
#14 0x0000000000577c53 in exec_simple_query ()

Re: Load distributed checkpoint

From

Martijn van Oosterhout

Date:

21 December 2006, 06:27:25

On Thu, Dec 21, 2006 at 06:46:36PM +0900, ITAGAKI Takahiro wrote:
> > Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
> > waiting for lseek() to complete when it committed.  But why does the
> > backend which is syncing WAL/pg_control have to wait for syncing the
> > data file?  They are, not to mention, different files, and WAL and
> > data files are stored on separate disks.
>
> Backends call lseek() in planning, so they have to wait fsync() to
> the table that they will access. Even if all of data in the file is in
> the cache, lseek() conflict with fsync(). You can see a lot of backends
> are waiting in planning phase in checkpoints, not executing phase.

Hmm, there are other ways to sync parts of a file. For example doing an
mmap()/msync()/munmap() cycle to start an asyncronous flush. But given
what you're saying that might suffer from the same problem.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Load distributed checkpoint

From

"Zeugswetter Andreas ADI SD"

Date:

21 December 2006, 10:04:30

> > You were running the test on the very memory-depend machine.
> >> shared_buffers = 4GB / The scaling factor is 50, 800MB of data.
> > Thet would be why the patch did not work. I tested it with DBT-2,
10GB of
> > data and 2GB of memory. Storage is always the main part of
performace here,
> > even not in checkpoints.
>
> Yes, I used half the size of RAM as the shared buffers, which is
> reasonable.  And I cached all the data.

For pg, half RAM for shared_buffers is too much. The ratio is good for
other db software, that does not use the OS cache.

Andreas

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

21 December 2006, 20:17:07

----- Original Message ----- 
From: "Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>; "ITAGAKI
Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
> > Yes, I used half the size of RAM as the shared buffers, which is
> > reasonable.  And I cached all the data.

> For pg, half RAM for shared_buffers is too much. The ratio is good
for
> other db software, that does not use the OS cache.

What percentage of RAM is recommended for shared buffers in general?
40%?  30%?  Or, is the general recommendation like "According to the
amount of your data, this much RAM should be left for the kernel
cache.  But tha's the story on Linux.  It may be different for other
OSes."?
Hmm,  if it is so, it sounds hard for system designers/administrators
to judge.




----- Original Message ----- 
From: "Zeugswetter Andreas ADI SD" <ZeugswetterA@spardat.at>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>; "ITAGAKI
Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
Cc: <pgsql-hackers@postgresql.org>
Sent: Thursday, December 21, 2006 11:04 PM
Subject: RE: [HACKERS] Load distributed checkpoint



> > You were running the test on the very memory-depend machine.
> >> shared_buffers = 4GB / The scaling factor is 50, 800MB of data.
> > Thet would be why the patch did not work. I tested it with DBT-2,
10GB of
> > data and 2GB of memory. Storage is always the main part of
performace here,
> > even not in checkpoints.
>
> Yes, I used half the size of RAM as the shared buffers, which is
> reasonable.  And I cached all the data.

For pg, half RAM for shared_buffers is too much. The ratio is good for
other db software, that does not use the OS cache.

Andreas

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

21 December 2006, 23:19:58

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

> > For pg, half RAM for shared_buffers is too much. The ratio is good for
> > other db software, that does not use the OS cache.
> 
> What percentage of RAM is recommended for shared buffers in general?
> 40%?  30%?  Or, is the general recommendation like "According to the
> amount of your data, this much RAM should be left for the kernel
> cache.  But tha's the story on Linux.  It may be different for other
> OSes."?
> Hmm,  if it is so, it sounds hard for system designers/administrators
> to judge.

If you use linux, try the following settings: 1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio. 2.
Increasewal_buffers to redule WAL flushing. 3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync(). 4.
Separatedata and WAL files into different partitions or disks.

I suppose 1 is important for you, because kernel will not write dirty
buffers until 10% of buffers become dirty in default settings.
You have large memory (8GB), but small data set (800MB). So kernel
almost never writes buffers not in checkpoints. Accumulate dirty buffers
are written at a burst in fsync().

We would be happy if we would be free from a difficult combination
of tuning. If you have *idea for improvements*, please suggest it.
I think we've already understood *problem itself*.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

22 December 2006, 02:13:21

Hello, Itagaki-san,

Thank you for an interesting piece of information.

From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
> If you use linux, try the following settings:
>  1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.
>  2. Increase wal_buffers to redule WAL flushing.
>  3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync().
>  4. Separate data and WAL files into different partitions or disks.
>
> I suppose 1 is important for you, because kernel will not write
dirty
> buffers until 10% of buffers become dirty in default settings.
> You have large memory (8GB), but small data set (800MB). So kernel
> almost never writes buffers not in checkpoints. Accumulate dirty
buffers
> are written at a burst in fsync().

I'll show the results of this tuning to share information with people
who don't have experience of this kind.
The numbers shown below are the tps when running "pgbench -c16 -t100
postgres" five times in succession.

(1) Default case(this is show again for comparison and reminder)
The bgwriter_* and checkpoint_* are set to those defaults.
wal_buffers and wal_sync_method are also set to those defaults (64kB
and fdatasync respectively.)

235  80  226  77  240


(2) Default + WAL 1MB case
The configuration is the same as case (1) except that wal_buffers is
set to 1024kB.

302  328  82  330  85

This is better improvement than I expected.


(3) Default + wal_sync_method=open_sync case
The configuration is the same as case (1) except that wal_sync_method
is set to open_sync.

162  67  176  67  164

Too bad compared to case (2).  Do you know the reason?


(4) (2)+(3) case

322  350  85  321  84

This is good, too.


(5) (4) + /proc/sys/vm/dirty* tuning
dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
changed from 40 to 4.

308  349  84  349  84

The tuning of kernel cache doesn't appear to bring performance
improvement in my env.  The kernel still waits too long before it
starts flushing dirty buffers because the cache is large?  If so,
increasingly available RAM may cause trouble more frequently in the
near future.  Do the dirty_*_ratio accept values less than 1?

BTW, in case (1), the best response time of a transaction was 6
milliseconds.  On the other hand, the worst response time was 13
seconds.


> We would be happy if we would be free from a difficult combination
> of tuning. If you have *idea for improvements*, please suggest it.
> I think we've already understood *problem itself*.

I agree with you.  Let's make the ideas more concrete, doing some
experimentations.

Re: Load distributed checkpoint

From

Greg Smith

Date:

22 December 2006, 02:16:09

On Wed, 20 Dec 2006, Inaam Rana wrote:

> Talking of bgwriter_* parameters I think we are missing a crucial 
> internal counter i.e. number of dirty pages. How much work bgwriter has 
> to do at each wakeup call should be a function of total buffers and 
> currently dirty buffers.

This is actually a question I'd been meaning to throw out myself to this 
list.  How hard would it be to add an internal counter to the buffer 
management scheme that kept track of the current number of dirty pages? 
I've been looking at the bufmgr code lately trying to figure out how to 
insert one as part of building an auto-tuning bgwriter, but it's unclear 
to me how I'd lock such a resource properly and scalably.  I have a 
feeling I'd be inserting a single-process locking bottleneck into that 
code with any of the naive implementations I considered.

The main problem I've been seeing is also long waits stuck behind a slow 
fsync on Linux.  What I've been moving toward testing is an approach 
slightly different from the proposals here.  What if all the database page 
writes (background writer, buffer eviction, or checkpoint scan) were 
counted and periodic fsync requests send to the bgwriter based on that? 
For example, when I know I have a battery-backed caching controller that 
will buffer 64MB worth of data for me, if I forced a fsync after every 
6000 8K writes, no single fsync would get stuck waiting for the disk to 
write for longer than I'd like.

Give the admin a max_writes_before_sync parameter, make the default of 0 
work just like the current behavior, and off you go; a simple tunable that 
doesn't require a complicated scheme to implement or break anybody's 
existing setup.  Combined with a properly tuned background writer, that 
would solve the issues I've been running into.  It would even make the 
problem of Linux caching too many writes until checkpoint time go away (I 
know how to eliminate that by adjusting caching policy, but I have to be 
root to do it; a DBA should be able to work around that issue even if they 
don't have access to the kernel tunables.)

While I'm all for testing to prove me wrong, my gut feel is that going all 
the way to sync writes a la Oracle is a doomed approach, particularly on 
low-end hardware where they're super expensive.  Following The Oracle Way 
is a good roadmap for a lot of things, but I wouldn't put building a lean 
enough database to run on modest hardware on that list.  You can do sync 
writes with perfectly good performance on systems with a good 
battery-backed cache, but I think you'll get creamed in comparisons 
against MySQL on IDE disks if you start walking down that path; since 
right now a fair comparison with similar logging behavior is an even match 
there, that's a step backwards.

Also on the topic of sync writes to the database proper:  wouldn't using 
O_DIRECT for those potentially counter-productive?  I was under the 
impressions that one of the behaviors counted on by Postgres was that data 
evicted from its buffer cache, eventually intended for writing to disk, 
was still kept around for a bit in the OS buffer cache.  A subsequent read 
because the data was needed again might find the data already in the OS 
buffer, therefore avoiding an actual disk read; that substantially reduces 
the typical penalty for the database engine making a bad choice on what to 
evict.  I fear a move to direct writes would put more pressure on the LRU 
implementation to be very smart, and that's code that you really don't 
want to be more complicated.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

22 December 2006, 02:58:10

From: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
> (5) (4) + /proc/sys/vm/dirty* tuning
> dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
> changed from 40 to 4.
>
> 308  349  84  349  84

Sorry, I forgot to include the result when using Itagaki-san's patch.
The patch showd the following tps for case (5).

323  350  340  59  225

The best response time was 4 msec, and the worst one was 16 seconds.


----- Original Message ----- 
From: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
To: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
Cc: <pgsql-hackers@postgresql.org>
Sent: Friday, December 22, 2006 3:20 PM
Subject: Re: [HACKERS] Load distributed checkpoint


> Hello, Itagaki-san,
>
> Thank you for an interesting piece of information.
>
> From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
>> If you use linux, try the following settings:
>>  1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.
>>  2. Increase wal_buffers to redule WAL flushing.
>>  3. Set wal_sync_method to open_sync; O_SYNC is faster then
fsync().
>>  4. Separate data and WAL files into different partitions or disks.
>>
>> I suppose 1 is important for you, because kernel will not write
> dirty
>> buffers until 10% of buffers become dirty in default settings.
>> You have large memory (8GB), but small data set (800MB). So kernel
>> almost never writes buffers not in checkpoints. Accumulate dirty
> buffers
>> are written at a burst in fsync().
>
> I'll show the results of this tuning to share information with
people
> who don't have experience of this kind.
> The numbers shown below are the tps when running "pgbench -c16 -t100
> postgres" five times in succession.
>
> (1) Default case(this is show again for comparison and reminder)
> The bgwriter_* and checkpoint_* are set to those defaults.
> wal_buffers and wal_sync_method are also set to those defaults (64kB
> and fdatasync respectively.)
>
> 235  80  226  77  240
>
>
> (2) Default + WAL 1MB case
> The configuration is the same as case (1) except that wal_buffers is
> set to 1024kB.
>
> 302  328  82  330  85
>
> This is better improvement than I expected.
>
>
> (3) Default + wal_sync_method=open_sync case
> The configuration is the same as case (1) except that
wal_sync_method
> is set to open_sync.
>
> 162  67  176  67  164
>
> Too bad compared to case (2).  Do you know the reason?
>
>
> (4) (2)+(3) case
>
> 322  350  85  321  84
>
> This is good, too.
>
>
> (5) (4) + /proc/sys/vm/dirty* tuning
> dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
> changed from 40 to 4.
>
> 308  349  84  349  84
>
> The tuning of kernel cache doesn't appear to bring performance
> improvement in my env.  The kernel still waits too long before it
> starts flushing dirty buffers because the cache is large?  If so,
> increasingly available RAM may cause trouble more frequently in the
> near future.  Do the dirty_*_ratio accept values less than 1?
>
> BTW, in case (1), the best response time of a transaction was 6
> milliseconds.  On the other hand, the worst response time was 13
> seconds.
>
>
>> We would be happy if we would be free from a difficult combination
>> of tuning. If you have *idea for improvements*, please suggest it.
>> I think we've already understood *problem itself*.
>
> I agree with you.  Let's make the ideas more concrete, doing some
> experimentations.
>
>
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                http://www.postgresql.org/about/donate
>

Re: Load distributed checkpoint

From

"Inaam Rana"

Date:

22 December 2006, 03:48:36

On 12/22/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:

From: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
> (5) (4) + /proc/sys/vm/dirty* tuning
> dirty_background_ratio is changed from 10 to 1, and dirty_ratio is
> changed from 40 to 4.
>
> 308 349 84 349 84

Sorry, I forgot to include the result when using Itagaki-san's patch.
The patch showd the following tps for case (5).

323 350 340 59 225

The best response time was 4 msec, and the worst one was 16 seconds.

Which IO Shceduler (elevator) you are using?

--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

22 December 2006, 04:21:31

From: Inaam Rana
> Which IO Shceduler (elevator) you are using?

Elevator? Sorry, I'm not familiar with the kernel implementation, so I don't what it is. My Linux distribution is Red Hat Enterprise Linux 4.0 for AMD64/EM64T, and the kernel is 2.6.9-42.ELsmp. I probably havn't changed any kernel settings, except for IPC settings to run PostgreSQL.

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

22 December 2006, 05:09:41

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

> (1) Default case(this is show again for comparison and reminder)
> 235  80  226  77  240
> (2) Default + WAL 1MB case
> 302  328  82  330  85
> (3) Default + wal_sync_method=open_sync case
> 162  67  176  67  164
> (4) (2)+(3) case
> 322  350  85  321  84
> (5) (4) + /proc/sys/vm/dirty* tuning
> 308  349  84  349  84

(3) is very strange. Your machine seems to be too restricted
by WAL so that other factors cannot be measured properly.


I'll send results on my machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)
- shared_buffers=1GB
- wal_sync_method = open_datasync
- wal_buffers = 1MB
- checkpoint_segments = 16
- checkpoint_timeout = 5min

I repeated "pgbench -c16 -t500 -s50"
and picked up results around checkpoints.

[HEAD]
...
560.8
373.5 <- checkpoint is here
570.8
...

[with patch]
...
562.0
528.4 <- checkpoint (fsync) is here
547.0
...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

22 December 2006, 05:31:26

From: "Greg Smith" <gsmith@gregsmith.com>
> This is actually a question I'd been meaning to throw out myself to
this
> list.  How hard would it be to add an internal counter to the buffer
> management scheme that kept track of the current number of dirty
pages?
> I've been looking at the bufmgr code lately trying to figure out how
to
> insert one as part of building an auto-tuning bgwriter, but it's
unclear
> to me how I'd lock such a resource properly and scalably.  I have a
> feeling I'd be inserting a single-process locking bottleneck into
that
> code with any of the naive implementations I considered.

To put it in an extreme way, how about making bgwriter count the dirty
buffers periodically scanning all the buffers?  Do you know the book
"Principles of Transaction Processing"?  Jim Gray was one of the
reviewers of this book.


http://www.amazon.com/gp/aa.html?HMAC=&CartId=&Operation=ItemLookup&&ItemId=1558604154&ResponseGroup=Request,Large,Variations&bStyle=aaz.jpg&MerchantId=All&isdetail=true&bsi=Books&logo=foo&Marketplace=us&AssociateTag=pocketpc

In chapter 8, the author describes fuzzy checkpoint combined with
two-checkpoint approach.  In his explanation, recovery manager (which
would be bgwriter in PostgreSQL) scans the buffers and records the
list of dirty buffers at each checkpoint.  This won't need any locking
in PostgreSQL if I understand correctly.  Then, the recovery manager
performs the next checkpoint after writing those dirty buffers.  In
two-checkpoint approach, crash recovery starts redoing from the second
to last checkpoint.  Two-checkpoint is described in Jim Gray's book,
too.  But they don't refer to how the recovery manager tunes the speed
of writing.


> slightly different from the proposals here.  What if all the
database page
> writes (background writer, buffer eviction, or checkpoint scan) were
> counted and periodic fsync requests send to the bgwriter based on
that?
> For example, when I know I have a battery-backed caching controller
that
> will buffer 64MB worth of data for me, if I forced a fsync after
every
> 6000 8K writes, no single fsync would get stuck waiting for the disk
to
> write for longer than I'd like.

That seems interesting.

> You can do sync
> writes with perfectly good performance on systems with a good
> battery-backed cache, but I think you'll get creamed in comparisons
> against MySQL on IDE disks if you start walking down that path;
since
> right now a fair comparison with similar logging behavior is an even
match
> there, that's a step backwards.

I wonder what characteristics SATA disks have compared to IDE.  Recent
PCs are equiped with SATA disks, aren't they?
What do you feel your approach compares to MySQL on IDE disks?

> Also on the topic of sync writes to the database proper:  wouldn't
using
> O_DIRECT for those potentially counter-productive?  I was under the
> impressions that one of the behaviors counted on by Postgres was
that data
> evicted from its buffer cache, eventually intended for writing to
disk,
> was still kept around for a bit in the OS buffer cache.  A
subsequent read
> because the data was needed again might find the data already in the
OS
> buffer, therefore avoiding an actual disk read; that substantially
reduces
> the typical penalty for the database engine making a bad choice on
what to
> evict.  I fear a move to direct writes would put more pressure on
the LRU
> implementation to be very smart, and that's code that you really
don't
> want to be more complicated.

I'm worried about this, too.

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

22 December 2006, 07:02:01

> (3) is very strange. Your machine seems to be too restricted
> by WAL so that other factors cannot be measured properly.

Right... It takes as long as 15 seconds to fsync 1GB file.  It's
strange.  This is a borrowed PC server, so the disk may be RAID 5?
However, the WAL disk and DB  disks show the same throughput.  I'll
investigate.  I may have to find another machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)

Oh, Windows.  Maybe the fsync() problem Itagaki-san pointed out does
not exist.
BTW, your env is showing attractive result, isn't it?

----- Original Message ----- 
From: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
Cc: <pgsql-hackers@postgresql.org>
Sent: Friday, December 22, 2006 6:09 PM
Subject: Re: [HACKERS] Load distributed checkpoint

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

> (1) Default case(this is show again for comparison and reminder)
> 235  80  226  77  240
> (2) Default + WAL 1MB case
> 302  328  82  330  85
> (3) Default + wal_sync_method=open_sync case
> 162  67  176  67  164
> (4) (2)+(3) case
> 322  350  85  321  84
> (5) (4) + /proc/sys/vm/dirty* tuning
> 308  349  84  349  84

(3) is very strange. Your machine seems to be too restricted
by WAL so that other factors cannot be measured properly.

I'll send results on my machine.

- Pentium4 3.6GHz with HT / 3GB RAM / Windows XP :-)
- shared_buffers=1GB
- wal_sync_method = open_datasync
- wal_buffers = 1MB
- checkpoint_segments = 16
- checkpoint_timeout = 5min

I repeated "pgbench -c16 -t500 -s50"
and picked up results around checkpoints.

[HEAD]
...
560.8
373.5 <- checkpoint is here
570.8
...

[with patch]
...
562.0
528.4 <- checkpoint (fsync) is here
547.0
...

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

"Zeugswetter Andreas ADI SD"

Date:

22 December 2006, 08:51:51

> > If you use linux, try the following settings:
> >  1. Decrease /proc/sys/vm/dirty_ratio and dirty_background_ratio.

You will need to pair this with bgwriter_* settings, else too few
pages are written to the os inbetween checkpoints.

> >  2. Increase wal_buffers to redule WAL flushing.

You will want the possibility of single group writes to be able to reach

256kb. The default is thus not enough when you have enough RAM.
You also want enough, so that new txns don't need to wait for an empty
buffer (that is only freed by a write).

> >  3. Set wal_sync_method to open_sync; O_SYNC is faster then fsync().

O_SYNC's only advantage over fdatasync is that it saves a system call,
since it still passes through OS cache, but the disadvantage is that it
does not let the OS group writes. Thus it is more susceptible to too
few WAL_buffers. What you want is O_DIRECT + enough wal_buffers to allow

256k writes.

> >  4. Separate data and WAL files into different partitions or disks.

While this is generally suggested, I somehow doubt the validity when you
only have few disk spindles. If e.g. you only have 2-3 (mirrored) disks
I
wouldn't do it (at least on the average 70/30 read write systems).

Andreas

Re: Load distributed checkpoint

From

"Simon Riggs"

Date:

22 December 2006, 10:35:29

On Thu, 2006-12-21 at 18:46 +0900, ITAGAKI Takahiro wrote:
> "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:
> 
> > > If you use Linux, it has very unpleased behavior in fsync(); It locks all
> > > metadata of the file being fsync-ed. We have to wait for the completion of
> > > fsync when we do read(), write(), and even lseek().
> > 
> > Oh, really, what an evil fsync is!  Yes, I sometimes saw a backend
> > waiting for lseek() to complete when it committed.  But why does the
> > backend which is syncing WAL/pg_control have to wait for syncing the
> > data file?  They are, not to mention, different files, and WAL and
> > data files are stored on separate disks.
> 
> Backends call lseek() in planning, so they have to wait fsync() to
> the table that they will access. Even if all of data in the file is in
> the cache, lseek() conflict with fsync(). You can see a lot of backends
> are waiting in planning phase in checkpoints, not executing phase.

It isn't clear to me why you are doing planning during a test at all.

If you are doing replanning during test execution then the real
performance problem will be the planning, not the fact that the fsync
stops planning from happening.

Prepared queries are only replanned manually, so the chances of
replanning during a checkpoint are fairly low. So although it sounds
worrying, I'm not sure that we'll want to alter the use of lseek during
planning - though there may be other arguments also.

I have also seen cases where the WAL drive, even when separated, appears
to spike upwards during a checkpoint. My best current theory, so far
untested, is that the WAL and data drives are using the same CFQ
scheduler and that the scheduler actively slows down WAL requests when
it need not. Mounting the drives as separate block drives with separate
schedulers, CFQ for data and Deadline for WAL should help.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint

From

Bruce Momjian

Date:

22 December 2006, 14:53:29

I have a new idea.  Rather than increasing write activity as we approach
checkpoint, I think there is an easier solution.  I am very familiar
with the BSD kernel, and it seems they have a similar issue in trying to
smooth writes:http://www.brno.cas.cz/cgi-bin/bsdi-man?proto=1.1&query=update&msection=4&apropos=0UPDATE(4)
    BSD Programmer's Manual                  UPDATE(4)NAME     update - trickle sync filesystem caches to
diskDESCRIPTION    At system boot time, the kernel starts filesys_syncer, process     3.  This process helps protect
theintegrity of disk volumes     by ensuring that volatile cached filesystem data are written     to disk within the
vfs.generic.syncdelayinterval which defaults     to thirty seconds (see sysctl(8)).  When a vnode is first     written
itis placed vfs.generic.syncdelay seconds down on     the trickle sync queue.  If it still exists and has dirty data
when it reaches the top of the queue, filesys_syncer writes     it to disk.  This approach evens out the load on the
underlying    I/O system and avoids writing short-lived files.  The pa- pers     on trickle-sync tend to favor aging
basedon buffers rather     than files.  However, BSD/OS synchronizes on file age rather     than buffer age because the
datastructures are much smaller     as there are typically far fewer files than buffers.  Although     this can make
theI/O bursty when a big file is written to     disk, it is still much better than the wholesale writes that     were
beingdone by the historic update process which wrote     all dirty data buffers every 30 seconds.  It also adapts much
  better to the soft update code which wants to control aging     to improve performance (inodes age in one third of
vfs.generic.syncdelay seconds, directories in one half of     vfs.generic.syncdelay seconds).  This ordering ensures
that    most dependencies are gone (e.g., inodes are written when     directory en- tries want to go to disk) reducing
theamount     of work that the soft up- date code needs to do.
 

I assume other kernels have similar I/O smoothing, so that data sent to
the kernel via write() gets to disk within 30 seconds.  

I assume write() is not our checkpoint performance problem, but the
transfer to disk via fsync().  Perhaps a simple solution is to do the
write()'s of all dirty buffers as we do now at checkpoint time, but
delay 30 seconds and then do fsync() on all the files.  The goal here is
that during the 30-second delay, the kernel will be forcing data to the
disk, so the fsync() we eventually do will only be for the write() of
buffers during the 30-second delay, and because we wrote all dirty
buffers 30 seconds ago, there shouldn't be too many of them.

I think the basic difference between this and the proposed patch is that
we do not put delays in the buffer write() or fsync() phases --- we just
put a delay _between_ the phases, and wait for the kernel to smooth it
out for us.  The kernel certainly knows more about what needs to get to
disk, so it seems logical to let it do the I/O smoothing.

---------------------------------------------------------------------------

Bruce Momjian wrote:
> 
> I have thought a while about this and I have some ideas.
> 
> Ideally, we would be able to trickle the sync of individuals blocks
> during the checkpoint, but we can't because we rely on the kernel to
> sync all dirty blocks that haven't made it to disk using fsync().  We
> could trickle the fsync() calls, but that just extends the amount of
> data we are writing that has been dirtied post-checkpoint.  In an ideal
> world, we would be able to fsync() only part of a file at a time, and
> only those blocks that were dirtied pre-checkpoint, but I don't see that
> happening anytime soon (and one reason why many commercial databases
> bypass the kernel cache).
> 
> So, in the real world, one conclusion seems to be that our existing
> method of tuning the background writer just isn't good enough for the
> average user:
> 
>     #bgwriter_delay = 200ms                 # 10-10000ms between rounds
>     #bgwriter_lru_percent = 1.0             # 0-100% of LRU buffers scanned/round
>     #bgwriter_lru_maxpages = 5              # 0-1000 buffers max written/round
>     #bgwriter_all_percent = 0.333           # 0-100% of all buffers scanned/round
>     #bgwriter_all_maxpages = 5              # 0-1000 buffers max written/round
> 
> These settings control what the bgwriter does, but they do not clearly
> relate to the checkpoint timing, which is the purpose of the bgwriter,
> and they don't change during the checkpoint interval, which is also less
> than ideal.  If set to aggressively, it writes too much, and if too low,
> the checkpoint does too much I/O.
> 
> We clearly need more bgwriter activity as the checkpoint approaches, and
> one that is more auto-tuned, like many of our other parameters.  I think
> we created these settings to see how they worked in the field, so it
> probably time to reevaluate them based on field reports.
> 
> I think the bgwriter should keep track of how far it is to the next
> checkpoint, and use that information to increase write activity. 
> Basically now, during a checkpoint, the bgwriter does a full buffer scan
> and fsync's all dirty files, so it changes from the configuration
> parameter-defined behavior right to 100% activity.  I think it would be
> ideal if we could ramp up the writes so that when it is 95% to the next
> checkpoint, it can be operating at 95% of the activity it would do
> during a checkpoint.
> 
> My guess is if we can do that, we will have much smoother performance
> because we have more WAL writes just after checkpoint for newly-dirtied
> pages, and the new setup will give us more write activity just before
> checkpoint.
> 
> One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
> the kernel cache, so we are sure data will be on disk by checkpoint
> time.  This was avoided in the past because of the expense of
> second-guessing the kernel disk I/O scheduling algorithms.
> 
> ---------------------------------------------------------------------------
> 
> Tom Lane wrote:
> > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> > > "Jim C. Nasby" <jim@nasby.net> wrote: 
> > >> Generally, I try and configure the all* settings so that you'll get 1
> > >> clock-sweep per checkpoint_timeout. It's worked pretty well, but I don't
> > >> have any actual tests to back that methodology up.
> > 
> > > We got to these numbers somewhat scientifically.  I studied I/O
> > > patterns under production load and figured we should be able to handle
> > > about 800 writes in per 200 ms without causing problems.  I have to
> > > admit that I based the percentages and the ratio between "all" and "lru"
> > > on gut feel after musing over the documentation.
> > 
> > I like Kevin's settings better than what Jim suggests.  If the bgwriter
> > only makes one sweep between checkpoints then it's hardly going to make
> > any impact at all on the number of dirty buffers the checkpoint will
> > have to write.  The point of the bgwriter is to reduce the checkpoint
> > I/O spike by doing writes between checkpoints, and to have any
> > meaningful impact on that, you'll need it to make the cycle several times.
> > 
> > Another point here is that you want checkpoints to be pretty far apart
> > to minimize the WAL load from full-page images.  So again, a bgwriter
> > that's only making one loop per checkpoint is not gonna be doing much.
> > 
> > I wonder whether it would be feasible to teach the bgwriter to get more
> > aggressive as the time for the next checkpoint approaches?  Writes
> > issued early in the interval have a much higher probability of being
> > wasted (because the page gets re-dirtied later).  But maybe that just
> > reduces to what Takahiro-san already suggested, namely that
> > checkpoint-time writes should be done with the same kind of scheduling
> > the bgwriter uses outside checkpoints.  We still have the problem that
> > the real I/O storm is triggered by fsync() not write(), and we don't
> > have a way to spread out the consequences of fsync().
> > 
> >             regards, tom lane
> > 
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >        choose an index scan if your joining column's datatypes do not
> >        match
> 
> -- 
>   Bruce Momjian   bruce@momjian.us
>   EnterpriseDB    http://www.enterprisedb.com
> 
>   + If your life is a hard drive, Christ can be your backup. +
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Load distributed checkpoint

From

"Inaam Rana"

Date:

22 December 2006, 14:56:16

On 12/22/06, Takayuki Tsunakawa <tsunakawa.takay@jp.fujitsu.com> wrote:

From: Inaam Rana
> Which IO Shceduler (elevator) you are using?

Elevator? Sorry, I'm not familiar with the kernel implementation, so I don't what it is. My Linux distribution is Red Hat Enterprise Linux 4.0 for AMD64/EM64T, and the kernel is 2.6.9-42.ELsmp. I probably havn't changed any kernel settings, except for IPC settings to run PostgreSQL.

There are four IO schedulers in Linux. Anticipatory, CFQ (default), deadline, and noop. For typical OLTP type loads generally deadline is recommended. If you are constrained on CPU and you have a good controller then its better to use noop.
Deadline attempts to merge requests by maintaining two red black trees in sector sort order and it also ensures that a request is serviced in given time by using FIFO. I don't expect it to do the magic but was wondering that it may dilute the issue of fsync() elbowing out WAL writes.

You can look into /sys/block/<device>/queue/scheduler to see which scheduler you are using.

regards,
inaam

--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com

Re: Load distributed checkpoint

From

Gregory Stark

Date:

22 December 2006, 17:14:20

"Bruce Momjian" <bruce@momjian.us> writes:

> I have a new idea.  Rather than increasing write activity as we approach
> checkpoint, I think there is an easier solution.  I am very familiar
> with the BSD kernel, and it seems they have a similar issue in trying to
> smooth writes:

Just to give a bit of context for this. The traditional mechanism for syncing
buffers to disk on BSD which this daemon was a replacement for was to simply
call "sync" every 30s. Compared to that this daemon certainly smooths the I/O
out over the 30s window...

Linux has a more complex solution to this (of course) which has undergone a
few generations over time. Older kernels had a user space daemon called
bdflush which called an undocumented syscall every 5s. More recent ones have a
kernel thread called pdflush. I think both have various mostly undocumented
tuning knobs but neither makes any sort of guarantee about the amount of time
a dirty buffer might live before being synced.

Your thinking is correct but that's already the whole point of bgwriter isn't
it? To get the buffers out to the kernel early in the checkpoint interval so
that come checkpoint time they're hopefully already flushed to disk. As long
as your checkpoint interval is well over 30s only the last 30s (or so, it's a
bit fuzzier on Linux) should still be at risk of being pending.

I think the main problem with an additional pause in the hopes of getting more
buffers synced is that during the 30s pause on a busy system there would be a
continual stream of new dirty buffers being created as bgwriter works and
other backends need to reuse pages. So when the fsync is eventually called
there will still be a large amount of i/o to do. Fundamentally the problem is
that fsync is too blunt an instrument. We only need to fsync the buffers we
care about, not the entire file.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com

Re: Load distributed checkpoint

From

Greg Smith

Date:

22 December 2006, 17:33:50

On Fri, 22 Dec 2006, Simon Riggs wrote:

> I have also seen cases where the WAL drive, even when separated, appears
> to spike upwards during a checkpoint. My best current theory, so far
> untested, is that the WAL and data drives are using the same CFQ
> scheduler and that the scheduler actively slows down WAL requests when
> it need not. Mounting the drives as separate block drives with separate
> schedulers, CFQ for data and Deadline for WAL should help.

The situation I've been seeing is that the database needs a new block to 
complete a query and issues a read request to get it, but that read is 
behind the big checkpoint fsync.  Client sits there for quite some time 
waiting for the fsync to finish before it gets the data it needs, and now 
your trivial select took seconds to complete.  It's fairly easy to 
replicate this problem using pgbench on Linux--I've seen a query sit there 
for 15 seconds when going out of my way to aggrevate the behavior.  One of 
Takayuki's posts here mentioned a worst-case delay of 13 seconds, that's 
the problem rearing its ugly head.

You may be right that what you're seeing would be solved with a more 
complicated tuning on a per-device basis (which, by the way, isn't 
available unless you're running a more recent Linux kernel than most many 
distributions have available).  You can tune the schedulers all day and 
not make a lick of difference to what I've been running into; I know, I 
tried.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load distributed checkpoint

From

Bruce Momjian

Date:

22 December 2006, 18:34:21

Gregory Stark wrote:
> 
> "Bruce Momjian" <bruce@momjian.us> writes:
> 
> > I have a new idea.  Rather than increasing write activity as we approach
> > checkpoint, I think there is an easier solution.  I am very familiar
> > with the BSD kernel, and it seems they have a similar issue in trying to
> > smooth writes:
> 
> Just to give a bit of context for this. The traditional mechanism for syncing
> buffers to disk on BSD which this daemon was a replacement for was to simply
> call "sync" every 30s. Compared to that this daemon certainly smooths the I/O
> out over the 30s window...
> 
> Linux has a more complex solution to this (of course) which has undergone a
> few generations over time. Older kernels had a user space daemon called
> bdflush which called an undocumented syscall every 5s. More recent ones have a
> kernel thread called pdflush. I think both have various mostly undocumented
> tuning knobs but neither makes any sort of guarantee about the amount of time
> a dirty buffer might live before being synced.
> 
> Your thinking is correct but that's already the whole point of bgwriter isn't
> it? To get the buffers out to the kernel early in the checkpoint interval so
> that come checkpoint time they're hopefully already flushed to disk. As long
> as your checkpoint interval is well over 30s only the last 30s (or so, it's a
> bit fuzzier on Linux) should still be at risk of being pending.
> 
> I think the main problem with an additional pause in the hopes of getting more
> buffers synced is that during the 30s pause on a busy system there would be a
> continual stream of new dirty buffers being created as bgwriter works and
> other backends need to reuse pages. So when the fsync is eventually called
> there will still be a large amount of i/o to do. Fundamentally the problem is
> that fsync is too blunt an instrument. We only need to fsync the buffers we
> care about, not the entire file.

Well, one idea would be for the bgwriter not to do many write()'s
between the massive checkpoint write()'s and the fsync()'s.  That would
cut down on the extra I/O that fsync() would have to do.

The problem I see with making the bgwriter do more writes between
checkpoints is that overhead of those scans, and the overhead of doing
write's that will later be dirtied before the checkpoint.  With the
delay between stages idea, we don't need to guess how agressive the
bgwriter needs to be --- we can just do the writes, and wait for a
while.

On an idle system, would someone dirty a large file, and watch the disk
I/O to see how long it takes for the I/O to complete to disk?

In what we have now, we are either having the bgwriter do too much I/O
between checkpoints, or guaranteeing an I/O storm during a checkpoint by
doing lots of write()'s and then calling fsync() right away.  I don't
see how we are ever going to get that properly tuned.

Would someone code up a patch and test it?

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

25 December 2006, 23:06:56

Hello, Inaam-san,

> There are four IO schedulers in Linux. Anticipatory, CFQ (default), deadline, and noop. For typical OLTP type loads generally deadline is recommended. If you are constrained on CPU and you have a good controller then its better to use noop.
> Deadline attempts to merge requests by maintaining two red black trees in sector sort order and it also ensures that a request is serviced in given time by using FIFO. I don't expect it to do the magic but was wondering that it may dilute the issue of fsync() elbowing out WAL writes.
> You can look into /sys/block/<device>/queue/scheduler to see which scheduler you are using.

Thank you for your information.
I could only find the following files in /sys/block/<device>/queue/:

iosched
max_hw_sectors_kb
max_sectors_kb
nr_requests
read_ahead_kb

In iosched, the following files exist:

quantum (the content is "4")
queued (the content is "8")

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

25 December 2006, 23:13:45

From: "Bruce Momjian" <bruce@momjian.us>
> On an idle system, would someone dirty a large file, and watch the
disk
> I/O to see how long it takes for the I/O to complete to disk?

I ran "dd if=/dev/zero of=<file on DB disk> bs=8k count=`expr 1048576
/ 8`, that is, writing 1GB file with 8KB write()'s.  It took about 175
seconds for the kernel to flush buffers.  I'll put the result of
"iostat -x 5" which was started at the same time as dd.  175 seconds
means 35 rows with nonzero %util * 5 seconds of interval.

Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s
avgrq-sz avgqu-sz   await  svctm  %util
sde          0.00 981.56  0.00 50.90    0.00 8439.28     0.00  4219.64
165.80    34.38  525.01   6.25  31.82
sde          0.00 25737.00  0.00 118.60    0.00 219820.80     0.00
109910.40  1853.46  6529.74 1587.08   8.43 100.02
sde          0.00 1912.97  0.00 127.74    0.00 16325.75     0.00
8162.87   127.80  8192.67 6502.98   7.81  99.82
sde          0.00 1728.00  0.00 117.80    0.00 14745.60     0.00
7372.80   125.17  8209.36 11227.29   8.49 100.02
sde          0.00 1536.00  0.00 103.80    0.00 13107.20     0.00
6553.60   126.27  8209.10 16729.09   9.64 100.02
sde          0.00 1344.00  0.00 91.60    0.00 11468.80     0.00
5734.40   125.21  8208.69 21481.22  10.92 100.02
sde          0.00 1532.80  0.00 98.00    0.00 13081.60     0.00
6540.80   133.49  8209.34 26583.57  10.21 100.02
sde          0.00 1632.00  0.00 107.00    0.00 13926.40     0.00
6963.20   130.15  8208.89 31662.93   9.35 100.02
sde          0.00 1536.00  0.00 106.40    0.00 13107.20     0.00
6553.60   123.19  8209.66 36443.07   9.40 100.02
sde          0.00 1442.89  0.00 95.59    0.00 12312.63     0.00
6156.31   128.81  8227.23 41446.78  10.49 100.24
sde          0.00 1532.80  0.00 103.00    0.00 13081.60     0.00
6540.80   127.01  8210.77 46606.03   9.71 100.04
sde          0.00 1440.00  0.00 92.00    0.00 12288.00     0.00
6144.00   133.57  8208.82 51421.78  10.87 100.02
sde          0.00 1344.00  0.00 91.80    0.00 11468.80     0.00
5734.40   124.93  8209.86 56524.37  10.90 100.02
sde          0.00 1539.08  0.00 101.00    0.00 13133.47     0.00
6566.73   130.03  8225.59 61477.93   9.92 100.22
sde          0.00 1436.80  0.00 95.40    0.00 12262.40     0.00
6131.20   128.54  8208.88 66566.42  10.48 100.02
sde          0.00 1344.00  0.00 92.40    0.00 11468.80     0.00
5734.40   124.12  8209.47 71466.12  10.82 100.02
sde          0.00 1102.40  0.00 101.00    0.00 9408.38     0.00
4704.19    93.15  8174.36 76538.41   9.88  99.82
sde          0.00   0.00  0.00 89.00    0.00    0.00     0.00     0.00
0.00  7855.72 80795.64  11.24 100.02
sde          0.00   0.00  0.00 91.38    0.00    0.00     0.00     0.00
0.00  7422.53 81823.89  10.97 100.22
sde          0.00   0.00  0.00 96.80    0.00    0.00     0.00     0.00
0.00  6919.07 83194.91  10.33 100.02
sde          0.00   0.00  0.00 91.78    0.00    0.00     0.00     0.00
0.00  6480.85 84657.04  10.92 100.22
sde          0.00   0.00  0.00 92.60    0.00    0.00     0.00     0.00
0.00  5997.22 84749.79  10.80 100.02
sde          0.00   0.00  0.00 91.02    0.00    0.00     0.00     0.00
0.00  5528.97 85345.07  10.97  99.82
sde          0.00   0.00  0.00 102.61    0.00    0.00     0.00
0.00     0.00  5059.61 85057.91   9.77 100.22
sde          0.00   0.00  0.00 93.20    0.00    0.00     0.00     0.00
0.00  4572.57 85284.49  10.73 100.02
sde          0.00   0.00  0.00 98.20    0.00    0.00     0.00     0.00
0.00  4109.15 86086.50  10.21 100.22
sde          0.00   0.00  0.00 91.42    0.00    0.00     0.00     0.00
0.00  3611.72 86405.24  10.92  99.82
sde          0.00   0.00  0.00 100.00    0.00    0.00     0.00
0.00     0.00  3135.62 86292.49  10.00 100.02
sde          0.00   0.00  0.00 100.40    0.00    0.00     0.00
0.00     0.00  2652.63 86609.79   9.96 100.02
sde          0.00   0.00  0.00 92.80    0.00    0.00     0.00     0.00
0.00  2153.69 86168.58  10.78 100.02
sde          0.00   0.00  0.00 88.80    0.00    0.00     0.00     0.00
0.00  1694.74 86275.58  11.26 100.02
sde          0.00   0.00  0.00 98.20    0.00    0.00     0.00     0.00
0.00  1241.41 86708.40  10.19 100.02
sde          0.00   0.00  0.00 95.20    0.00    0.00     0.00     0.00
0.00   747.52 86505.59  10.51 100.02
sde          0.00   0.00  0.00 89.20    0.00    0.00     0.00     0.00
0.00   283.44 86551.11  11.21 100.02
sde          0.00  51.20  0.00 17.20    0.00  449.60     0.00   224.80
26.14     4.30 61572.65   9.05  15.56
sde          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sde          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sde          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sde          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00
sde          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00
0.00     0.00    0.00   0.00   0.00

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

26 December 2006, 05:58:36

Bruce Momjian <bruce@momjian.us> wrote:

> I assume write() is not our checkpoint performance problem, but the
> transfer to disk via fsync().  Perhaps a simple solution is to do the
> write()'s of all dirty buffers as we do now at checkpoint time, but
> delay 30 seconds and then do fsync() on all the files.

I think there are two platforms that have different problems in checkpoints.
It's in fsync() on one platform, and in write() on another. It is complex
depending on OS, the amount of memory, disks, writeback-cache and so on.

> I think the basic difference between this and the proposed patch is that
> we do not put delays in the buffer write() or fsync() phases --- we just
> put a delay _between_ the phases, and wait for the kernel to smooth it
> out for us.  The kernel certainly knows more about what needs to get to
> disk, so it seems logical to let it do the I/O smoothing.

Both proposals do not conflict each other. Also, solutions for either
platform do not have bad effect on the other platform. Can we employ
both of them?

I tested your proposal but it did not work on write-critical machine.
However, if the idea works well on BSD or some platforms, we would be
better off buying it.

[pgbench results]
...
566.973777
327.158222 <- (1) write()
560.773868 <- (2) sleep
544.106645 <- (3) fsync()
...

[changes in codes] (This is a bad implementation because shutdown takes long time!)
void
FlushBufferPool(void)
{   BufferSync();                       // (1) write -- about 20s
   time_t start = time(NULL);   while (time(NULL) - start < 30)     // (2) sleep -- 30s   {
pg_usleep(BgWriterDelay* 1000L);       BgBufferSync();       AbsorbFsyncRequests();   } 
   smgrsync();                         // (3) fsync -- less than 200ms
}

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

Bruce Momjian

Date:

27 December 2006, 00:10:24

ITAGAKI Takahiro wrote:
> 
> Bruce Momjian <bruce@momjian.us> wrote:
> 
> > I assume write() is not our checkpoint performance problem, but the
> > transfer to disk via fsync().  Perhaps a simple solution is to do the
> > write()'s of all dirty buffers as we do now at checkpoint time, but
> > delay 30 seconds and then do fsync() on all the files.
> 
> I think there are two platforms that have different problems in checkpoints.
> It's in fsync() on one platform, and in write() on another. It is complex
> depending on OS, the amount of memory, disks, writeback-cache and so on.
> 
> > I think the basic difference between this and the proposed patch is that
> > we do not put delays in the buffer write() or fsync() phases --- we just
> > put a delay _between_ the phases, and wait for the kernel to smooth it
> > out for us.  The kernel certainly knows more about what needs to get to
> > disk, so it seems logical to let it do the I/O smoothing.
> 
> Both proposals do not conflict each other. Also, solutions for either
> platform do not have bad effect on the other platform. Can we employ
> both of them?
> 
> I tested your proposal but it did not work on write-critical machine.
> However, if the idea works well on BSD or some platforms, we would be
> better off buying it.
> 
> [pgbench results]
> ...
> 566.973777
> 327.158222 <- (1) write()
> 560.773868 <- (2) sleep
> 544.106645 <- (3) fsync()

OK, so you are saying that performance dropped only during the write(),
and not during the fsync()?  Interesting.  I would like to know the
results of a few tests just like you reported them above:1a) write spread out over 30 seconds1b) write with no delay2a)
sleep(0)2b)sleep(30) 3) fsync
 

I would like to know the performance at each stage for each combination,
e.g. when using 1b, 2a, 3, performance during the write() phase was X,
during the sleep it was Y, and during the fsync it was Z. (Of course,
sleep(0) has no stage timing.)

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Load distributed checkpoint

From

"Simon Riggs"

Date:

27 December 2006, 17:59:03

On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:

> I assume other kernels have similar I/O smoothing, so that data sent to
> the kernel via write() gets to disk within 30 seconds.  
> 
> I assume write() is not our checkpoint performance problem, but the
> transfer to disk via fsync().  

Well, its correct to say that the transfer to disk is the source of the
problem, but that doesn't only occur when we fsync(). There are actually
two disk storms that occur, because of the way the fs cache works. [Ron
referred to this effect uplist]

Linux 2.6+ will attempt to write to disk any dirty blocks in excess of a
certain threshold, the dirty _background_ratio, which defaults to 10% of
RAM. So when the checkpoint issues lots of write() calls, we generally
exceed the threshold and then begin io storm number 1 to get us back
down to the dirty_background_ratio. When we issue the fsync() calls, we
then begin io storm number 2, which takes us down close to zero dirty
blocks (on a dedicated server). [Thanks to my colleague Richard Kennedy
for confirming this via investigation; he's been on leave throughout
this discussion, regrettably].

Putting delays in is very simple and does help, however, how much it
helps depends upon:
- the number of dirty blocks in shared_buffers 
- the dirty_background_ratio
- the number of dirty blocks in each file when we fsync()

For example, on a system with a very large RAM, and yet a medium write
workload, the dirty_background_ratio may never be exceeded. In that
case, all of the I/O happens during storm 2, so . If you set
dirty_background_ratio lower, then most of the writes happen during
storm 1.

During storm 2, the fsync calls write all dirty blocks in a file to
disk. In many cases, a few tables/files have all of the writes, so
adding a delay between fsyncs doesn't spread out the writes like you
would hope it would.

Most of the time, storm 1 and storm 2 run together in a continuous
stream, but sometimes you see a double peak. There is an overlap of a
few seconds between 1 and 2 in many cases.

Linux will also write blocks to disk after a period of inactivity,
dirty_expire_centisecs which by default is 30 seconds. So putting a
delay between storm1 and storm2 should help matters somewhat, but 30
secs is probably almost exactly the wrong number (by chance), though I
do like the idea.

> Perhaps a simple solution is to do the
> write()'s of all dirty buffers as we do now at checkpoint time, but
> delay 30 seconds and then do fsync() on all the files. 

So yes, putting a short delay, say 10 seconds, in at that point should
help matters somewhat, sometimes. (But the exact number depends upon how
the OS is tuned.)

>  The goal here is
> that during the 30-second delay, the kernel will be forcing data to the
> disk, so the fsync() we eventually do will only be for the write() of
> buffers during the 30-second delay, and because we wrote all dirty
> buffers 30 seconds ago, there shouldn't be too many of them.

...but not for that reason.

IMHO the best thing to do is to
1. put a short delay between the write() steps in the checkpoint
2. put a longer delay in between the write() phase and the fsync() phase
3. tune the OS writeback mechanism to help smoothing
Either set (1) and (2) as GUCs, or have code that reads the OS settings
and acts accordingly.

Or alternatively put, both Bruce and Itagaki-san have good ideas.

> > So, in the real world, one conclusion seems to be that our existing
> > method of tuning the background writer just isn't good enough for the
> > average user:
> > 
> >     #bgwriter_delay = 200ms                 # 10-10000ms between rounds
> >     #bgwriter_lru_percent = 1.0             # 0-100% of LRU buffers scanned/round
> >     #bgwriter_lru_maxpages = 5              # 0-1000 buffers max written/round
> >     #bgwriter_all_percent = 0.333           # 0-100% of all buffers scanned/round
> >     #bgwriter_all_maxpages = 5              # 0-1000 buffers max written/round
> > 
> > These settings control what the bgwriter does, but they do not clearly
> > relate to the checkpoint timing, which is the purpose of the bgwriter,
> > and they don't change during the checkpoint interval, which is also less
> > than ideal.  If set to aggressively, it writes too much, and if too low,
> > the checkpoint does too much I/O.

Yes, that's very clear.
> > We clearly need more bgwriter activity as the checkpoint approaches

I'd put it that we should write a block to disk prior to checkpoint if
it appears that it won't be dirtied again if we do so. That doesn't
necessarily translate directly into *more* activity.

> , and
> > one that is more auto-tuned, like many of our other parameters.  I think
> > we created these settings to see how they worked in the field, so it
> > probably time to reevaluate them based on field reports.
> > 
> > I think the bgwriter should keep track of how far it is to the next
> > checkpoint, and use that information to increase write activity. 
> > Basically now, during a checkpoint, the bgwriter does a full buffer scan
> > and fsync's all dirty files, so it changes from the configuration
> > parameter-defined behavior right to 100% activity.  I think it would be
> > ideal if we could ramp up the writes so that when it is 95% to the next
> > checkpoint, it can be operating at 95% of the activity it would do
> > during a checkpoint.
> > 
> > My guess is if we can do that, we will have much smoother performance
> > because we have more WAL writes just after checkpoint for newly-dirtied
> > pages, and the new setup will give us more write activity just before
> > checkpoint.

Well, as long as the kernel ignores Postgres and Postgres ignores the
kernel, things will never be smooth (literally). If we write more, but
are still below the dirty_background_ratio, it won't make the slightest
bit of difference. 

Trying to get trustworthy/explicable performance test reports is already
difficult for this reason.

IMHO, the best approach would be one that takes into how the OS behaves,
so we can work with it. Regrettably, I can't see any way of doing this
other than OS-specific code of some shape or form.
> > One other idea is for the bgwriter to use O_DIRECT or O_SYNC to avoid
> > the kernel cache, so we are sure data will be on disk by checkpoint
> > time.  This was avoided in the past because of the expense of
> > second-guessing the kernel disk I/O scheduling algorithms.

Seems like a longer term best approach to me.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint

From

"Simon Riggs"

Date:

27 December 2006, 18:08:54

On Mon, 2006-12-18 at 14:47 +0900, Takayuki Tsunakawa wrote: 
> Hello, Itagaki-san, all
> 
> Sorry for my long mail.  I've had trouble in sending this mail because
> it's too long for pgsql-hackers to accept (I couldn't find how large
> mail is accepted.)  So I'm trying to send several times.
> Please see the attachment for the content.

Your results for fsync are interesting.

I've noticed that a checkpoint seems to increase the activity on the WAL
drive as well as increasing I/O wait times. That doesn't correspond to
any real increase in WAL traffic I'm aware of.

Have you tried setting deadline scheduler on the WAL device and CFQ on
the data device? That should allow the I/Os to move through different
queues and prevent interference.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint

From

Martijn van Oosterhout

Date:

27 December 2006, 18:27:42

On Wed, Dec 27, 2006 at 09:24:06PM +0000, Simon Riggs wrote:
> On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:
>
> > I assume other kernels have similar I/O smoothing, so that data sent to
> > the kernel via write() gets to disk within 30 seconds.
> >
> > I assume write() is not our checkpoint performance problem, but the
> > transfer to disk via fsync().
>
> Well, its correct to say that the transfer to disk is the source of the
> problem, but that doesn't only occur when we fsync(). There are actually
> two disk storms that occur, because of the way the fs cache works. [Ron
> referred to this effect uplist]

As someone looking from the outside:

fsync only works on one file, so presumably the checkpoint process is
opening each file one by one and fsyncing them. Does that make any
difference here? Could you adjust the timing here?

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Load distributed checkpoint

From

"Simon Riggs"

Date:

27 December 2006, 18:55:55

On Wed, 2006-12-27 at 23:26 +0100, Martijn van Oosterhout wrote:
> On Wed, Dec 27, 2006 at 09:24:06PM +0000, Simon Riggs wrote:
> > On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:
> > 
> > > I assume other kernels have similar I/O smoothing, so that data sent to
> > > the kernel via write() gets to disk within 30 seconds.  
> > > 
> > > I assume write() is not our checkpoint performance problem, but the
> > > transfer to disk via fsync().  
> > 
> > Well, its correct to say that the transfer to disk is the source of the
> > problem, but that doesn't only occur when we fsync(). There are actually
> > two disk storms that occur, because of the way the fs cache works. [Ron
> > referred to this effect uplist]
> 
> As someone looking from the outside:
> 
> fsync only works on one file, so presumably the checkpoint process is
> opening each file one by one and fsyncing them. 

Yes

> Does that make any
> difference here? Could you adjust the timing here?

Thats the hard bit with io storm 2. When you fsync a file you don't
actually know how many blocks you're writing, plus there's no way to
slow down those writes by putting delays between them (although its
possible your controller might know how to do this, I've never heard of
one that does).

If we put a delay after each fsync, that will space out all the ones
that don't need spacing out and do nothing to the ones that most need
it. Unfortunately.

IMHO there isn't any simple scheme that works all the time, for all OS
settings, default configurations and mechanisms.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

27 December 2006, 21:12:00

Bruce Momjian <bruce@momjian.us> wrote:

> > 566.973777
> > 327.158222 <- (1) write()
> > 560.773868 <- (2) sleep
> > 544.106645 <- (3) fsync()
> 
> OK, so you are saying that performance dropped only during the write(),
> and not during the fsync()?  Interesting.

Almost yes, but there is a small drop in fsync. (560->540)

> I would like to know the
> results of a few tests just like you reported them above:
>     
>     1a) write spread out over 30 seconds
>     1b) write with no delay
>     
>     2a) sleep(0)
>     2b) sleep(30)
>     
>      3) fsync
> 
> I would like to know the performance at each stage for each combination,
> e.g. when using 1b, 2a, 3, performance during the write() phase was X,
> during the sleep it was Y, and during the fsync it was Z. (Of course,
> sleep(0) has no stage timing.)

I'm thinking about generalizing your idea; Adding three parameters
(checkpoint_write, checkpoint_naptime and checkpoint_fsync)
to control sleeps in each stage.

1) write() spread out over 'checkpoint_write' seconds.
2) sleep 'checkpoint_naptime' seconds between write() and fsync().
3) fsync() spread out over 'checkpoint_fsync' seconds.

If three parameter are all zero, checkpoints behave as the same as now.
If checkpoint_write = checkpoint_timeout and other two are zero,
it is just like my proposal before.

As you might expect, I intend the above only for development purpose.
Additinal three parameters are hard to use for users. If we can pull out
some proper values from the tests, we'd better to set those values as
default. I assume we can derive them from existing checkpoint_timeout.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

Ron Mayer

Date:

28 December 2006, 05:19:53

Gregory Stark wrote:
> "Bruce Momjian" <bruce@momjian.us> writes:
> 
>> I have a new idea.  ...the BSD kernel...similar issue...to smooth writes:
> Linux has a more complex solution to this (of course) which has undergone a
> few generations over time. Older kernels had a user space daemon called
> bdflush which called an undocumented syscall every 5s. More recent ones have a
> kernel thread called pdflush. I think both have various mostly undocumented
> tuning knobs but neither makes any sort of guarantee about the amount of time
> a dirty buffer might live before being synced.

Earlier in this thread (around the 7th) was a discussion of
/proc/sys/vm/dirty_expire_centisecs and /proc/vm/dirty_writeback_centisecs
which seem to be the tunables that matter here.  Googling suggests that
 dirty_expire_centisecs specifies that data which has been dirty in memory for longer than this interval will be
writtenout next time a pdflush daemon wakes up"

and
 dirty_writeback_centisecs "expresses the interval between those wakeups"

It seems to me that the sum of the two times does determine the maximum
time before the kernel will start syncing a dirtied page.

Bottom line, though is that it seems both postgresql and the OS's are
trying to delay writes in the hopes of collapsing them; and that the
actual delay is the sum of the OS's delay and postgresql's delay.  I
think Kevin Grittner's experimentation earlier in the thread did indeed
suggest that getting writes to the OS faster and let it handle the
collapsing of the writes was an effective method of reducing painful
checkpoints.

Re: Load distributed checkpoint

From

"Jim C. Nasby"

Date:

28 December 2006, 07:36:07

On Wed, Dec 27, 2006 at 10:54:57PM +0000, Simon Riggs wrote:
> On Wed, 2006-12-27 at 23:26 +0100, Martijn van Oosterhout wrote:
> > On Wed, Dec 27, 2006 at 09:24:06PM +0000, Simon Riggs wrote:
> > > On Fri, 2006-12-22 at 13:53 -0500, Bruce Momjian wrote:
> > > 
> > > > I assume other kernels have similar I/O smoothing, so that data sent to
> > > > the kernel via write() gets to disk within 30 seconds.  
> > > > 
> > > > I assume write() is not our checkpoint performance problem, but the
> > > > transfer to disk via fsync().  
> > > 
> > > Well, its correct to say that the transfer to disk is the source of the
> > > problem, but that doesn't only occur when we fsync(). There are actually
> > > two disk storms that occur, because of the way the fs cache works. [Ron
> > > referred to this effect uplist]
> > 
> > As someone looking from the outside:
> > 
> > fsync only works on one file, so presumably the checkpoint process is
> > opening each file one by one and fsyncing them. 
> 
> Yes
> 
> > Does that make any
> > difference here? Could you adjust the timing here?
> 
> Thats the hard bit with io storm 2. When you fsync a file you don't
> actually know how many blocks you're writing, plus there's no way to
> slow down those writes by putting delays between them (although its
> possible your controller might know how to do this, I've never heard of
> one that does).

Any controller that sophisticated would likely also have a BBU and write
caching, which should greatly reduce the impact of at least the fsync
storm... unless you fill the cache. I suspect we might need a way to
control how much data we try and push out at a time to avoid that...

As for settings, I really like the simplicity of the Oracle system...
"Just try to ensure recovery takes about X amount of seconds". I like
the idea of a creeping checkpoint even more; only writing a buffer out
when we need to checkpoint it makes a lot more sense to me than trying
to guess when we'll next dirty a buffer. Such a system would probably
also be a lot easier to tune than the current bgwriter, even if we
couldn't simplify it all the way to "seconds for recovery".
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Load distributed checkpoint

From

Bruce Momjian

Date:

28 December 2006, 13:50:32

ITAGAKI Takahiro wrote:
> 
> Bruce Momjian <bruce@momjian.us> wrote:
> 
> > > 566.973777
> > > 327.158222 <- (1) write()
> > > 560.773868 <- (2) sleep
> > > 544.106645 <- (3) fsync()
> > 
> > OK, so you are saying that performance dropped only during the write(),
> > and not during the fsync()?  Interesting.
> 
> Almost yes, but there is a small drop in fsync. (560->540)
> 
> 
> > I would like to know the
> > results of a few tests just like you reported them above:
> >     
> >     1a) write spread out over 30 seconds
> >     1b) write with no delay
> >     
> >     2a) sleep(0)
> >     2b) sleep(30)
> >     
> >      3) fsync
> > 
> > I would like to know the performance at each stage for each combination,
> > e.g. when using 1b, 2a, 3, performance during the write() phase was X,
> > during the sleep it was Y, and during the fsync it was Z. (Of course,
> > sleep(0) has no stage timing.)
> 
> I'm thinking about generalizing your idea; Adding three parameters
> (checkpoint_write, checkpoint_naptime and checkpoint_fsync)
> to control sleeps in each stage.
> 
> 1) write() spread out over 'checkpoint_write' seconds.
> 2) sleep 'checkpoint_naptime' seconds between write() and fsync().
> 3) fsync() spread out over 'checkpoint_fsync' seconds.
> 
> If three parameter are all zero, checkpoints behave as the same as now.
> If checkpoint_write = checkpoint_timeout and other two are zero,
> it is just like my proposal before.
> 
> 
> As you might expect, I intend the above only for development purpose.
> Additinal three parameters are hard to use for users. If we can pull out
> some proper values from the tests, we'd better to set those values as
> default. I assume we can derive them from existing checkpoint_timeout.

Great idea, though I wouldn't bother with checkpoint_fsync.  I think
Simon's previous email spelled out the problems of trying to delay
fsyncs() --- in most cases, there will be one file with most of the I/O,
and that fsync is going to be the flood.  Basically, I think the
variability of table access is too great for the fsync delay to ever be
tunable by users.

To summarize, if we could have fsync() only write the dirty buffers that
happened as part of the checkpoint, we could delay the write() for the
entire time between checkpoints, but we can't do that, so we have to
make it user-tunable.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Load distributed checkpoint

From

"Jim C. Nasby"

Date:

28 December 2006, 13:55:15

On Thu, Dec 28, 2006 at 12:50:19PM -0500, Bruce Momjian wrote:
> To summarize, if we could have fsync() only write the dirty buffers that
> happened as part of the checkpoint, we could delay the write() for the
> entire time between checkpoints, but we can't do that, so we have to
> make it user-tunable.

What about the mmap/msync(?)/munmap idea someone mentioned?
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Load distributed checkpoint

From

Bruce Momjian

Date:

28 December 2006, 14:03:59

Jim C. Nasby wrote:
> On Thu, Dec 28, 2006 at 12:50:19PM -0500, Bruce Momjian wrote:
> > To summarize, if we could have fsync() only write the dirty buffers that
> > happened as part of the checkpoint, we could delay the write() for the
> > entire time between checkpoints, but we can't do that, so we have to
> > make it user-tunable.
> 
> What about the mmap/msync(?)/munmap idea someone mentioned?

I see that as similar to using O_DIRECT during checkpoint, which had
poor performance.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Load distributed checkpoint

From

Tom Lane

Date:

28 December 2006, 14:49:29

Bruce Momjian <bruce@momjian.us> writes:
> Jim C. Nasby wrote:
>> What about the mmap/msync(?)/munmap idea someone mentioned?

> I see that as similar to using O_DIRECT during checkpoint, which had
> poor performance.

That's a complete nonstarter on portability grounds, even if msync gave
us the desired semantics, which it doesn't.  It's no better than fsync
for our purposes.

To my mind the problem with fsync is not that it gives us too little
control but that it gives too much: we have to specify a particular
order of writing out files.  What we'd really like is a version of
sync(2) that tells us when it's done but doesn't constrain the I/O
scheduler's choices at all.  Unfortunately there's no such API ...
        regards, tom lane

Re: Load distributed checkpoint

From

Bruce Momjian

Date:

28 December 2006, 16:56:10

Tom Lane wrote:
> Bruce Momjian <bruce@momjian.us> writes:
> > Jim C. Nasby wrote:
> >> What about the mmap/msync(?)/munmap idea someone mentioned?
> 
> > I see that as similar to using O_DIRECT during checkpoint, which had
> > poor performance.
> 
> That's a complete nonstarter on portability grounds, even if msync gave
> us the desired semantics, which it doesn't.  It's no better than fsync
> for our purposes.
> 
> To my mind the problem with fsync is not that it gives us too little
> control but that it gives too much: we have to specify a particular
> order of writing out files.  What we'd really like is a version of
> sync(2) that tells us when it's done but doesn't constrain the I/O
> scheduler's choices at all.  Unfortunately there's no such API ...

Yea, we used to use sync() but that did all files, not just the
PostgreSQL ones.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Load distributed checkpoint

From

Heikki Linnakangas

Date:

28 December 2006, 17:29:07

Tom Lane wrote:
> To my mind the problem with fsync is not that it gives us too little
> control but that it gives too much: we have to specify a particular
> order of writing out files.  What we'd really like is a version of
> sync(2) that tells us when it's done but doesn't constrain the I/O
> scheduler's choices at all.  Unfortunately there's no such API ...

The problem I see with fsync is that it causes an immediate I/O storm as 
the OS tries to flush everything out as quickly as possible. But we're 
not in a hurry. What we'd need is a lazy fsync, that would tell the 
operating system "let me know when all these dirty buffers are written 
to disk, but I'm not in a hurry, take your time". It wouldn't change the 
scheduling of the writes, just inform the caller when they're done.

If we wanted more precise control of the flushing, we could use 
sync_file_range on Linux, but that's not portable. Nevertheless, I think  it would be OK to have an ifdef and use it on
platformsthat support 

it, if it gave a benefit.

As a side note, with full_page_writes on, a checkpoint wouldn't actually 
need to fsync those pages that have been written to WAL after the 
checkpoint started. Doesn't make much difference in most cases, but we 
could take that into account if we start taking more control of the 
flushing.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint

From

"Jim C. Nasby"

Date:

29 December 2006, 16:28:48

On Thu, Dec 28, 2006 at 09:28:48PM +0000, Heikki Linnakangas wrote:
> Tom Lane wrote:
> >To my mind the problem with fsync is not that it gives us too little
> >control but that it gives too much: we have to specify a particular
> >order of writing out files.  What we'd really like is a version of
> >sync(2) that tells us when it's done but doesn't constrain the I/O
> >scheduler's choices at all.  Unfortunately there's no such API ...
> 
> The problem I see with fsync is that it causes an immediate I/O storm as 
> the OS tries to flush everything out as quickly as possible. But we're 
> not in a hurry. What we'd need is a lazy fsync, that would tell the 
> operating system "let me know when all these dirty buffers are written 
> to disk, but I'm not in a hurry, take your time". It wouldn't change the 
> scheduling of the writes, just inform the caller when they're done.
> 
> If we wanted more precise control of the flushing, we could use 
> sync_file_range on Linux, but that's not portable. Nevertheless, I think 
>  it would be OK to have an ifdef and use it on platforms that support 
> it, if it gave a benefit.
I believe there's something similar for OS X as well. The question is:
would it be better to do that, or to just delay calling fsync until the
OS has had a chance to write things out.

> As a side note, with full_page_writes on, a checkpoint wouldn't actually 
> need to fsync those pages that have been written to WAL after the 
> checkpoint started. Doesn't make much difference in most cases, but we 
> could take that into account if we start taking more control of the 
> flushing.

Hrm, interesting point, but I suspect the window involved there is too
small to be worth worrying about.
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Load distributed checkpoint

From

Tom Lane

Date:

29 December 2006, 17:43:32

"Jim C. Nasby" <jim@nasby.net> writes:
> I believe there's something similar for OS X as well. The question is:
> would it be better to do that, or to just delay calling fsync until the
> OS has had a chance to write things out.

A delay is not going to help unless you can suppress additional writes
to the file, which I don't think you can unless there's very little
going on in the database --- dirty buffers have to get written to make
room for other pages, checkpoint in progress or no.
        regards, tom lane

Re: Load distributed checkpoint

From

Bruce Momjian

Date:

29 December 2006, 22:02:22

Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > I believe there's something similar for OS X as well. The question is:
> > would it be better to do that, or to just delay calling fsync until the
> > OS has had a chance to write things out.
> 
> A delay is not going to help unless you can suppress additional writes
> to the file, which I don't think you can unless there's very little
> going on in the database --- dirty buffers have to get written to make
> room for other pages, checkpoint in progress or no.

I am afraid a delay between write and fsync is the only portable option
we have right now --- there is hope that since the check point write, we
will not have a huge number of dirty buffers at the start of the
checkpoint that need to be written out.

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Load distributed checkpoint

From

"Jim C. Nasby"

Date:

30 December 2006, 00:08:20

On Fri, Dec 29, 2006 at 09:02:11PM -0500, Bruce Momjian wrote:
> Tom Lane wrote:
> > "Jim C. Nasby" <jim@nasby.net> writes:
> > > I believe there's something similar for OS X as well. The question is:
> > > would it be better to do that, or to just delay calling fsync until the
> > > OS has had a chance to write things out.
> > 
> > A delay is not going to help unless you can suppress additional writes
> > to the file, which I don't think you can unless there's very little
> > going on in the database --- dirty buffers have to get written to make
> > room for other pages, checkpoint in progress or no.
> 
> I am afraid a delay between write and fsync is the only portable option
> we have right now --- there is hope that since the check point write, we
> will not have a huge number of dirty buffers at the start of the
> checkpoint that need to be written out.

We could potentially control that behavior, too. For example, trying to
suppress writes to a given file when we're getting ready to checkpoint
it (of course, separating checkpointing into a file-by-file operation is
a non-trivial change, but it should be possible).
-- 
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Load distributed checkpoint

From

"Zeugswetter Andreas ADI SD"

Date:

03 January 2007, 07:18:56

> > I believe there's something similar for OS X as well. The question
is:
> > would it be better to do that, or to just delay calling fsync until
the
> > OS has had a chance to write things out.
>
> A delay is not going to help unless you can suppress additional writes
> to the file, which I don't think you can unless there's very little
> going on in the database --- dirty buffers have to get written to make
> room for other pages, checkpoint in progress or no.

But checkpoint first writes all dirty pages, so we have more than
average
pages that can be replaced without a write. Thus we have a window where
we can wait until dirty pages have to be replaced again. Since the
bgwriter
is sleeping until fsync, only pages that have to be replaced will be
written.
Question is, how do we time that window.

Andreas

Re: Load distributed checkpoint

From

"Takayuki Tsunakawa"

Date:

08 January 2007, 22:03:51

Happy new year

From: "Simon Riggs" <simon@2ndquadrant.com>
> Have you tried setting deadline scheduler on the WAL device and CFQ
on
> the data device? That should allow the I/Os to move through
different
> queues and prevent interference.

No, I've not tried yet.  Inaam-san told me that Linux had a few I/O
schedulers but I'm not familiar with them.  I'll find information
about them (how to change the scheduler settings) and try the same
test.


----- Original Message ----- 
From: "Simon Riggs" <simon@2ndquadrant.com>
To: "Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com>
Cc: "ITAGAKI Takahiro" <itagaki.takahiro@oss.ntt.co.jp>;
<pgsql-hackers@postgresql.org>
Sent: Thursday, December 28, 2006 7:07 AM
Subject: Re: [HACKERS] Load distributed checkpoint


> On Mon, 2006-12-18 at 14:47 +0900, Takayuki Tsunakawa wrote:
>> Hello, Itagaki-san, all
>>
>> Sorry for my long mail.  I've had trouble in sending this mail
because
>> it's too long for pgsql-hackers to accept (I couldn't find how
large
>> mail is accepted.)  So I'm trying to send several times.
>> Please see the attachment for the content.
>
> Your results for fsync are interesting.
>
> I've noticed that a checkpoint seems to increase the activity on the
WAL
> drive as well as increasing I/O wait times. That doesn't correspond
to
> any real increase in WAL traffic I'm aware of.
>
> Have you tried setting deadline scheduler on the WAL device and CFQ
on
> the data device? That should allow the I/Os to move through
different
> queues and prevent interference.
>
> -- 
>  Simon Riggs
>  EnterpriseDB   http://www.enterprisedb.com
>
>
>

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

10 January 2007, 00:51:39

I wrote:
> I'm thinking about generalizing your idea; Adding three parameters
> to control sleeps in each stage.

I sent a patch to -patches that adds 3+1 GUC parameters for checkpoints.
We can use three of them to control sleeps in each stage during checkpoints.
The last is an experimental approach to replace fsync() for fine control.

1. checkpoint_write_duration (default=0, in seconds)       Sets the duration of write() phase in checkpoints.
2. checkpoint_nap_duration (default=0, in seconds)       Sets the duration between write() and fsync() phases in
checkpoints.
3. checkpoint_sync_duration (default=0, in seconds)       Sets the duration of fsync() phase in checkpoints.

The 1st parameter spreads write(). If you set checkpoint_write_duration
to 90% of checkpoint_timeout, it's just same as the patch I sent before.

The 2nd is naptime between write() and fsync() phases. Kernel's writer might
work much if you set it to around 30-60s, that might be useful for some
traditional UNIXes, as you say. In contrast, the 1st was the most variable
in my machine somehow (Windows and Linux).

The 3rd spreads fsync(). This parameter only works when you have several
tables or a very large table (that consists of some 1GB of files), because
fsync() is on a file basis.

Bruce Momjian <bruce@momjian.us> wrote:
> To summarize, if we could have fsync() only write the dirty buffers that
> happened as part of the checkpoint, we could delay the write() for the
> entire time between checkpoints, but we can't do that, so we have to
> make it user-tunable.

The 3rd has the above limitation so that I added another parameter.

4. checkpoint_sync_size (default=0, in KB)       Sets the synchronization unit of data files in checkpoints.

It uses sync_file_range or mmap/msync/munmap to divide file-synchronization
into specified granularity. I think 16-64MB fits the machines in that
performance are restricted by fsync() in checkpoints.

The feature is uncompleted. For example, sync_file_range does not flush
metadata of files in fact (it's equivalent of fdatasync), so we may lose
data under the patch. It must be fixed, but I want to measure the advantage
before that.

I'm interested in which parameter is useful for each environment.
Any comments and testing reports will be appreciated.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

"Inaam Rana"

Date:

11 January 2007, 14:04:17

No, I've not tried yet. Inaam-san told me that Linux had a few I/O
schedulers but I'm not familiar with them. I'll find information
about them (how to change the scheduler settings) and try the same
test.

I am sorry, your response just slipped by me. The docs for RHEL (I believe you are running RHEL which has 2.6.9 kernel) say that it does support selectable IO scheduler.

http://www.redhat.com/rhel/details/limits/

I am not sure where else to look for scheduler apart from /sys

regards,
inaam

Re: Load distributed checkpoint

From

Bruce Momjian

Date:

02 February 2007, 18:55:05

Thread added to TODO list:
* Reduce checkpoint performance degredation by forcing data to disk  more evenly
http://archives.postgresql.org/pgsql-hackers/2006-12/msg00337.php
http://archives.postgresql.org/pgsql-hackers/2007-01/msg00079.php


---------------------------------------------------------------------------

ITAGAKI Takahiro wrote:
> This is a proposal for load distributed checkpoint.
> (It is presented on postgresql anniversary summit in last summer.)
> 
> 
> We offen encounters performance gap during checkpoint. The reason is write
> bursts. Storage devices are too overworked in checkpoint, so they can not
> supply usual transaction processing.
> 
> Checkpoint consists of the following four steps, and the major performance
> problem is 2nd step. All dirty buffers are written without interval in it.
> 
>  1. Query information (REDO pointer, next XID etc.)
>  2. Write dirty pages in buffer pool
>  3. Flush all modified files
>  4. Update control file
> 
> I suggested to write pages with sleeping in 2nd step, using normal activity
> of the background writer. It is something like cost-based vacuum delay.
> Background writer has two pointers, 'ALL' and 'LRU', indicating where to 
> write out in buffer pool. We can wait for the ALL clock-hand going around
> to guarantee all pages to be written.
> 
> Here is pseudo-code for the proposed method. The internal loop is just the
> same as bgwriter's activity.
> 
>   PrepareCheckPoint();  -- do step 1
>   Reset num_of_scanned_pages by ALL activity;
>   do {
>       BgBufferSync();   -- do a part of step 2
>       sleep(bgwriter_delay);
>   } while (num_of_scanned_pages < shared_buffers);
>   CreateCheckPoint();   -- do step 3 and 4
> 
> 
> We may accelerate background writer to reduce works at checkpoint instead of
> the method, but it introduces another performance problem; Extra pressure
> is always put on the storage devices to keep the number of dirty pages low.
> 
> 
> I'm working about adjusting the progress of checkpoint to checkpoint timeout
> and wal segments limitation automatically to avoid overlap of two checkpoints.
> I'll post a patch sometime soon.
> 
> Comments and suggestions welcome.
> 
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend

--  Bruce Momjian   bruce@momjian.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +

Re: Load distributed checkpoint

From

"Inaam Rana"

Date:

26 February 2007, 14:20:55

On 12/19/06, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote:

"Takayuki Tsunakawa" <tsunakawa.takay@jp.fujitsu.com> wrote:

> I performed some simple tests, and I'll show the results below.

> (1) The default case
> 235  80  226 77  240
> (2) No write case
> 242  250  244  253  280
> (3) No checkpoint case
> 229  252  256  292  276
> (4) No fsync() case
> 236  112  215  216  221
> (5) No write by PostgreSQL, but fsync() by another program case
> 9  223  260  283  292
> (6) case (5) + O_SYNC by write_fsync
> 97  114  126  112  125
> (7) O_SYNC case
> 182  103  41  50  74

I posted a patch to PATCHES. Please try out it.
It does write() smoothly, but fsync() at a burst.
I suppose the result will be between (3) and (5).

Itagaki,

Did you had a chance to look into this any further? We, at EnterpriseDB, have done some testing on this patch (dbt2 runs) and it looks like we are getting the desired results, particularly so when we spread out both sync and write phases.

--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

26 February 2007, 20:40:59

"Inaam Rana" <inaamrana@gmail.com> wrote:

> Did you had a chance to look into this any further? We, at EnterpriseDB,
> have done some testing on this patch (dbt2 runs) and it looks like we are
> getting the desired results, particularly so when we spread out both sync
> and write phases.

Thank you for testing! Yes, I'm cleaning the patch. I changed configuration
parameters to delay each phase in checkpoints from setting absolute times
(checkpoint_xxx_duration) to setting relative to checkpoint_timeout
(checkpoint_xxx_percent). Delay factors strongly depend on total checkpoint
lengths, so I think the relative method is better.

I tested the patch on my machine, too. The belows are the results of
pgbench with/without the patch. As you see, the patch is not a complete
solution -- 12s of response time is not satisfiable yet -- but it was
better than other possible settings we could choose now.

pgbench -s100 -c16 -t100000 on the machine with 1GB ram and one SCSI drive
                         |   A    |   B    |   C    |   D    |
--------------------------+--------+--------+--------+--------+bgwriter_all_maxpages    | 5      | 5      | 60     |
120   |checkpoint_write_percent | 50.0   | 0      | 0      | 0      |checkpoint_nap_percent   | 10.0   | 0      | 0
| 0      |checkpoint_sync_percent  | 20.0   | 0      | 0      | 0      |

--------------------------+--------+--------+--------+--------+pgbench tps              | 612.23 | 517.64 | 488.90 |
378.39|response average         | 2.50ms | 2.89ms | 3.12ms | 3.99ms |response maximum         | 12.23s |123.66s |
55.09s| 36.72s |

--------------------------+--------+--------+--------+--------+

A: Proposed configurations with patch
B: Default configurations
C: Make bgwriter aggressive
D: Make bgwriter more aggressive

Other configurations:
- full_page_writes = off
- wal_buffers = 4MB
- checkpoint_segments = 32
- checkpoint_timeout = 15min

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Re: Load distributed checkpoint

From

Josh Berkus

Date:

26 February 2007, 21:46:34

Itagaki,

> Thank you for testing! Yes, I'm cleaning the patch. I changed
> configuration parameters to delay each phase in checkpoints from setting
> absolute times (checkpoint_xxx_duration) to setting relative to
> checkpoint_timeout (checkpoint_xxx_percent). Delay factors strongly
> depend on total checkpoint lengths, so I think the relative method is
> better.

Can I have a copy of the patch to add to the Sun testing queue?

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

26 February 2007, 22:10:14

Josh Berkus <josh@agliodbs.com> wrote:

> Can I have a copy of the patch to add to the Sun testing queue?

This is the revised version of the patch. Delay factors in checkpoints
can be specified by checkpoint_write_percent, checkpoint_nap_percent
and checkpoint_sync_percent. They are relative to checkpoint_timeout.

Also, checking of archive_timeout during checkpoints and some error
handling routines were added.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

checkpoint_02-27.patch.gz

Re: [PATCHES] Load distributed checkpoint

From

"Inaam Rana"

Date:

27 February 2007, 02:37:47

On 2/26/07, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote:

Josh Berkus <josh@agliodbs.com> wrote:

> Can I have a copy of the patch to add to the Sun testing queue?

This is the revised version of the patch. Delay factors in checkpoints
can be specified by checkpoint_write_percent, checkpoint_nap_percent
and checkpoint_sync_percent. They are relative to checkpoint_timeout.

Also, checking of archive_timeout during checkpoints and some error
handling routines were added.

One of the issues we had during testing with original patch was db stop not working properly. I think you coded something to do a stop checkpoint in immediately but if a checkpoint is already in progress at that time, it would take its own time to complete.
Does this patch resolve that issue? Also, is it based on pg82stable or HEAD?

regards,
inaam

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Inaam Rana
EnterpriseDB http://www.enterprisedb.com

Re: [PATCHES] Load distributed checkpoint

From

ITAGAKI Takahiro

Date:

27 February 2007, 04:08:47

"Inaam Rana" <inaamrana@gmail.com> wrote:

> One of the issues we had during testing with original patch was db stop not
> working properly. I think you coded something to do a stop checkpoint in
> immediately but if a checkpoint is already in progress at that time, it
> would take its own time to complete.
> Does this patch resolve that issue?

Yes, I fixed the problem. If a checkpoint by user SQL or shutdown is waiting
during an automatic checkpoint, the running checkpoint will be done without
any delays. At the worst case, you have to wait two checkpoints, (a running
automatic checkpoint and your explicit request) but nothing more of them.

> Also, is it based on pg82stable or HEAD?

It's based on HEAD.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center