Thread: Load distributed checkpoint V3

Load distributed checkpoint V3

From
ITAGAKI Takahiro
Date:
Folks,

Here is the latest version of Load distributed checkpoint patch.

I've fixed some bugs, including in cases of missing file errors
and overlapping of asynchronous checkpoint requests.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center

Attachment

Re: Load distributed checkpoint V3

From
Bruce Momjian
Date:
Your patch has been added to the PostgreSQL unapplied patches list at:

    http://momjian.postgresql.org/cgi-bin/pgpatches

It will be applied as soon as one of the PostgreSQL committers reviews
and approves it.

---------------------------------------------------------------------------


ITAGAKI Takahiro wrote:
> Folks,
>
> Here is the latest version of Load distributed checkpoint patch.
>
> I've fixed some bugs, including in cases of missing file errors
> and overlapping of asynchronous checkpoint requests.
>
> Regards,
> ---
> ITAGAKI Takahiro
> NTT Open Source Software Center

[ Attachment, skipping... ]

>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: Load distributed checkpoint V3

From
Greg Smith
Date:
On Fri, 23 Mar 2007, ITAGAKI Takahiro wrote:

> Here is the latest version of Load distributed checkpoint patch.

Couple of questions for you:

-Is it still possible to get the original behavior by adjusting your
tunables?  It would be nice to do a before/after without having to
recompile, and I know I'd be concerned about something so different
becoming the new default behavior.

-Can you suggest a current test case to demonstrate the performance
improvement here?  I've tried several variations on stretching out
checkpoints like you're doing here and they all made slow checkpoint
issues even worse on my Linux system.  I'm trying to evaluate this fairly.

-This code operates on the assumption you have a good value for the
checkpoint timeout.  Have you tested its behavior when checkpoints are
being triggered by checkpoint_segments being reached instead?

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load distributed checkpoint V3

From
ITAGAKI Takahiro
Date:
Greg Smith <gsmith@gregsmith.com> wrote:

> > Here is the latest version of Load distributed checkpoint patch.
>
> Couple of questions for you:
>
> -Is it still possible to get the original behavior by adjusting your
> tunables?  It would be nice to do a before/after without having to
> recompile, and I know I'd be concerned about something so different
> becoming the new default behavior.

Yes, if you want the original behavior, please set all of
checkpoint_[write|nap|sync]_percent to zero. They can be changed
at SIGHUP timing (pg_ctl reload). The new default configurations
are write/nap/sync = 50%/10%/20%. There might be room for discussion
in choice of the values.


> -Can you suggest a current test case to demonstrate the performance
> improvement here?  I've tried several variations on stretching out
> checkpoints like you're doing here and they all made slow checkpoint
> issues even worse on my Linux system.  I'm trying to evaluate this fairly.

You might need to increase checkpoint_segments and checkpoint_timeout.
Here is the results on my machine:
    http://archives.postgresql.org/pgsql-hackers/2007-02/msg01613.php
I've set the values to 32 segs and 15 min to take advantage of it
in the case of pgbench -s100 then.


> -This code operates on the assumption you have a good value for the
> checkpoint timeout.  Have you tested its behavior when checkpoints are
> being triggered by checkpoint_segments being reached instead?

This patch does not work fully when checkpoints are triggered by segments.
Write phases still work because they refer to consumption of segments,
but nap and fsync phases only check amount of time. I'm assuming
checkpoints are triggered by timeout in normal use -- and it's my
recommended configuration whether the patch is installed or not.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center



Re: Load distributed checkpoint V3

From
Greg Smith
Date:
On Mon, 26 Mar 2007, ITAGAKI Takahiro wrote:

> I'm assuming checkpoints are triggered by timeout in normal use -- and
> it's my recommended configuration whether the patch is installed or not.

I'm curious what other people running fairly serious hardware do in this
area for write-heavy loads, whether it's timeout or segment limits that
normally trigger their checkpoints.

I'm testing on a slightly different class of machine than your sample
results, something that is in the 1500 TPS range running the pgbench test
you describe.  Running that test, I always hit the checkpoint_segments
wall well before any reasonable timeout.  With 64 segments, I get a
checkpoint every two minutes or so.

There's something I'm working on this week that may help out other people
trying to test your patch out.  I've put together some simple scripts that
graph (patched) pgbench results, which make it very easy to see what
changes when you alter the checkpoint behavior.  Edges are still rough but
the scripts work for me, will be polishing and testing over the next few
days:

http://www.westnet.com/~gsmith/content/postgresql/pgbench.htm

(Note that the example graphs there aren't from the production system I
mentioned above, they're from my server at home, which is similar to the
system your results came from).

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load distributed checkpoint V3

From
Heikki Linnakangas
Date:
ITAGAKI Takahiro wrote:
> Here is the latest version of Load distributed checkpoint patch.

Unfortunately because of the recent instrumentation and
CheckpointStartLock patches this patch doesn't apply cleanly to CVS HEAD
anymore. Could you fix the bitrot and send an updated patch, please?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint V3

From
Heikki Linnakangas
Date:
ITAGAKI Takahiro wrote:
> Here is the latest version of Load distributed checkpoint patch.

Bgwriter has two goals:
1. keep enough buffers clean that normal backends never need to do a write
2. smooth checkpoints by writing buffers ahead of time

Load distributed checkpoints will do 2. in a much better way than the
bgwriter_all_* guc options. I think we should remove that aspect of
bgwriter in favor of this patch.

The scheduling of bgwriter gets quite complicated with the patch. If I'm
reading it correctly, bgwriter will keep periodically writing buffers to
achieve 1. while the "write"-phase of checkpoint is in progress. That
makes sense; now that checkpoints take longer, we would miss goal 1.
otherwise. But we don't do that in the "sleep-between-write-and-fsync"-
and "fsync"-phases. We should, shouldn't we?

I'd suggest rearranging the code so that BgBufferSync and mdsync would
basically stay like they are without the patch; the signature wouldn't
change. To do the naps during a checkpoint, inject calls to new
functions like CheckpointWriteNap() and CheckpointFsyncNap() inside
BgBufferSync and mdsync. Those nap functions would check if enough
progress has been made since last call and sleep if so.

The piece of code that implements 1. would be refactored to a new
function, let's say BgWriteLRUBuffers(). The nap-functions would call
BgWriteLRUBuffers if more than bgwriter_delay milliseconds have passed
since last call to it.

This way the changes to CreateCheckpoint, BgBufferSync and mdsync would
be minimal, and bgwriter would keep cleaning buffers for normal backends
during the whole checkpoint.

Another thought is to have a separate checkpointer-process so that the
bgwriter process can keep cleaning dirty buffers while the checkpoint is
running in a separate process. One problem with that is that we
currently collect all the fsync requests in bgwriter. If we had a
separate checkpointer process, we'd need to do that in the checkpointer
instead, and bgwriter would need to send a message to the checkpointer
every time it flushes a buffer, which would be a lot of chatter.
Alternatively, bgwriter could somehow pass the pendingOpsTable to the
checkpointer process at the beginning of checkpoint, but that not
exactly trivial either.

PS. Great that you're working on this. It's a serious problem under
heavy load.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint V3

From
Greg Smith
Date:
On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

> Unfortunately because of the recent instrumentation and CheckpointStartLock
> patches this patch doesn't apply cleanly to CVS HEAD anymore. Could you fix
> the bitrot and send an updated patch, please?

The "Logging checkpoints and other slowdown causes" patch I submitted
touches some of the same code as well, that's another possible merge
coming depending on what order this all gets committed in.  Running into
what I dubbed perpetual checkpoints was one of the reasons I started
logging timing information for the various portions of the checkpoint, to
tell when it was bogged down with slow writes versus being held up in sync
for various (possibly fixed with your CheckpointStartLock) issues.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load distributed checkpoint V3

From
Greg Smith
Date:
On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

> Bgwriter has two goals:
> 1. keep enough buffers clean that normal backends never need to do a write
> 2. smooth checkpoints by writing buffers ahead of time
> Load distributed checkpoints will do 2. in a much better way than the
> bgwriter_all_* guc options. I think we should remove that aspect of bgwriter
> in favor of this patch.

My first question about the LDC patch was whether I could turn it off and
return to the existing mechanism.  I would like to see a large pile of
data proving this new approach is better before the old one goes away.  I
think everyone needs to do some more research and measurement here before
assuming the problem can be knocked out so easily.

The reason I've been busy working on patches to gather statistics on this
area of code is because I've tried most simple answers to getting the
background writer to work better and made little progress, and I'd like to
see everyone else doing the same at least collecting the right data.

Let me suggest a different way of looking at this problem.  At any moment,
some percentage of your buffer pool is dirty.  Whether it's 0% or 100%
dramatically changes what the background writer should be doing.  Whether
most of the data is usage_count>0 or not also makes a difference.  None of
the current code has any idea what type of buffer pool they're working
with, and therefore they don't have enough information to make a
well-informed prediction about what is going to happen in the near future.

I'll tell you what I did to the all-scan.  I ran a few hundred hours worth
of background writer tests to collect data on what it does wrong, then
wrote a prototype automatic background writer that resets the all-scan
parameters based on what I found.  It keeps a running estimate of how
dirty the pool at large is using a weighted average of the most recent
scan with the past history.  From there, I have a simple model that
predicts how much of the buffer we can scan in any interval, and intends
to enforce a maximum bound on the amount of physical I/O you're willing to
stream out.  The beta code is sitting at
http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want to
see what I've done so far.  The parts that are done work fine--as long as
you give it a reasonable % to scan by default, it will correct
all_max_pages and the interval in real-time to meet the scan rate
requested you want given how much is currently dirty; the I/O rate is
computed but doesn't limit properly yet.

Why haven't I brought this all up yet?  Two reasons.  The first is because
it doesn't work on my system; checkpoints and overall throughput get worse
when you try to shorten them by running the background writer at optimal
aggressiveness.  Under really heavy load, the writes slow down as all the
disk caches fill, the background writer fights with reads on the data that
isn't in the mostly dirty cache (introducing massive seek delays), it
stops cleaning effectively, and it's better for it to not even try.  My
next generation of code was going to start with the LRU flush and then
only move onto the all-scan if there's time leftover.

The second is that I just started to get useful results here in the last
few weeks, and I assumed it's too big of a topic to start suggesting major
redesigns to the background writer mechanism at that point (from me at
least!).  I was waiting for 8.3 to freeze before even trying.  If you want
to push through a redesign there, maybe you can get away with it at this
late moment.  But I ask that you please don't remove anything from the
current design until you have significant test results to back up that
change.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load distributed checkpoint V3

From
Heikki Linnakangas
Date:
Greg Smith wrote:
> On Thu, 5 Apr 2007, Heikki Linnakangas wrote:
>
>> Bgwriter has two goals:
>> 1. keep enough buffers clean that normal backends never need to do a
>> write
>> 2. smooth checkpoints by writing buffers ahead of time
>> Load distributed checkpoints will do 2. in a much better way than the
>> bgwriter_all_* guc options. I think we should remove that aspect of
>> bgwriter in favor of this patch.
>
> ...
>
> Let me suggest a different way of looking at this problem.  At any
> moment, some percentage of your buffer pool is dirty.  Whether it's 0%
> or 100% dramatically changes what the background writer should be
> doing.  Whether most of the data is usage_count>0 or not also makes a
> difference.  None of the current code has any idea what type of buffer
> pool they're working with, and therefore they don't have enough
> information to make a well-informed prediction about what is going to
> happen in the near future.

The purpose of the bgwriter_all_* settings is to shorten the duration of
the eventual checkpoint. The reason to shorten the checkpoint duration
is to limit the damage to other I/O activity it causes. My thinking is
that assuming the LDC patch is effective (agreed, needs more testing) at
smoothening the checkpoint, the duration doesn't matter anymore. Do you
want to argue there's other reasons to shorten the checkpoint duration?

> I'll tell you what I did to the all-scan.  I ran a few hundred hours
> worth of background writer tests to collect data on what it does wrong,
> then wrote a prototype automatic background writer that resets the
> all-scan parameters based on what I found.  It keeps a running estimate
> of how dirty the pool at large is using a weighted average of the most
> recent scan with the past history.  From there, I have a simple model
> that predicts how much of the buffer we can scan in any interval, and
> intends to enforce a maximum bound on the amount of physical I/O you're
> willing to stream out.  The beta code is sitting at
> http://www.westnet.com/~gsmith/content/postgresql/bufmgr.c if you want
> to see what I've done so far.  The parts that are done work fine--as
> long as you give it a reasonable % to scan by default, it will correct
> all_max_pages and the interval in real-time to meet the scan rate
> requested you want given how much is currently dirty; the I/O rate is
> computed but doesn't limit properly yet.

Nice. Enforcing a max bound on the I/O seems reasonable, if we accept
that shortening the checkpoint is a goal.

> Why haven't I brought this all up yet?  Two reasons.  The first is
> because it doesn't work on my system; checkpoints and overall throughput
> get worse when you try to shorten them by running the background writer
> at optimal aggressiveness.  Under really heavy load, the writes slow
> down as all the disk caches fill, the background writer fights with
> reads on the data that isn't in the mostly dirty cache (introducing
> massive seek delays), it stops cleaning effectively, and it's better for
> it to not even try.  My next generation of code was going to start with
> the LRU flush and then only move onto the all-scan if there's time
> leftover.
>
> The second is that I just started to get useful results here in the last
> few weeks, and I assumed it's too big of a topic to start suggesting
> major redesigns to the background writer mechanism at that point (from
> me at least!).  I was waiting for 8.3 to freeze before even trying.  If
> you want to push through a redesign there, maybe you can get away with
> it at this late moment.  But I ask that you please don't remove anything
> from the current design until you have significant test results to back
> up that change.

Point taken. I need to start testing the LDC patch.

Since we're discussing this, let me tell what I've been thinking about
the lru cleaning behavior of bgwriter. ISTM that that's more
straigthforward to tune automatically. Bgwriter basically needs to
ensure that the next X buffers with usage_count=0 in the clock sweep are
clean. X is the predicted number of buffers backends will evict until
the next bgwriter round.

The number of buffers evicted by normal backends in a bgwriter_delay
period is simple to keep track of, just increase a counter in
StrategyGetBuffer and reset it when bgwriter wakes up. We can use that
as an estimate of X with some safety margin.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint V3

From
Tom Lane
Date:
Heikki Linnakangas <heikki@enterprisedb.com> writes:
> The number of buffers evicted by normal backends in a bgwriter_delay
> period is simple to keep track of, just increase a counter in
> StrategyGetBuffer and reset it when bgwriter wakes up. We can use that
> as an estimate of X with some safety margin.

You'd want some kind of moving-average smoothing in there, probably with
a lot shorter ramp-up than ramp-down time constant, but this seems
reasonable enough to try.

            regards, tom lane

Re: Load distributed checkpoint V3

From
Heikki Linnakangas
Date:
Tom Lane wrote:
> Heikki Linnakangas <heikki@enterprisedb.com> writes:
>> The number of buffers evicted by normal backends in a bgwriter_delay
>> period is simple to keep track of, just increase a counter in
>> StrategyGetBuffer and reset it when bgwriter wakes up. We can use that
>> as an estimate of X with some safety margin.
>
> You'd want some kind of moving-average smoothing in there, probably with
> a lot shorter ramp-up than ramp-down time constant, but this seems
> reasonable enough to try.

Ironically, I just noticed that we already have a patch in the patch
queue that implements exactly that, again by Itagaki. I need to start
paying more attention :-). Keep up the good work!

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint V3

From
"Takayuki Tsunakawa"
Date:
Hello, long time no see.

I'm sorry to interrupt your discussion. I'm afraid the code is getting
more complicated to continue to use fsync(). Though I don't intend to
say the current approach is wrong, could anyone evaluate O_SYNC
approach again that commercial databases use and tell me if and why
PostgreSQL's fsync() approach is better than theirs?

This January, I got a good result with O_SYNC, which I haven't
reported here. I'll show it briefly. Please forgive me for my abrupt
email, because I don't have enough time.
# Personally, I want to work in the community, if I'm allowed.
And sorry again. I reported that O_SYNC resulted in very bad
performance last year. But that was wrong. The PC server I borrowed
was configured that all the disks form one RAID5 device. So, the disks
for data and WAL (/dev/sdd and /dev/sde) came from the same RAID5
device, resulting in I/O conflict.

What I modified is md.c only. I just added O_SYNC to the open flag in
mdopen() and _mdfd_openseg(), if am_bgwriter is true. I didn't want
backends to use O_SYNC because mdextend() does not have to transfer
data to disk.

My evaluation environment was:

CPU: Intel Xeon 3.2GHz * 2 (HT on)
Memory: 4GB
Disk: Ultra320 SCSI (perhaps configured as write back)
OS: RHEL3.0 Update 6
Kernel: 2.4.21-37.ELsmp
PostgreSQL: 8.2.1

The relevant settings of PostgreSQL are:

shared_buffers = 2GB
wal_buffers = 1MB
wal_sync_method = open_sync
checkpoint_* and bgwriter_* parameters are left as their defaults.

I used pgbench, with the data of scaling factor 50.


[without O_SYNC, original behavior]
- pgbench -c1 -t16000
  best response: 1ms
  worst response: 6314ms
  10th worst response: 427ms
  tps: 318
- pgbench -c32 -t500
  best response: 1ms
  worst response: 8690ms
  10th worst response: 8668ms
  tps: 330

[with O_SYNC]
- pgbench -c1 -t16000
  best response: 1ms
  worst response: 350ms
  10th worst response: 91ms
  tps: 427
- pgbench -c32 -t500
  best response: 1ms
  worst response: 496ms
  10th worst response: 435ms
  tps: 1117

If the write back cache were disabled, the difference would be
smaller.
Windows version showed similar improvements.


However, this approach has two big problems.


(1) Slow down bulk updates

Updates of large amount of data get much slower because bgwriter seeks
and writes dirty buffers synchronously page-by-page. For example:

- COPY of accounts (5m records) and CHECKPOINT command after COPY
  without O_SYNC: 100sec
  with O_SYNC: 1046sec
- UPDATE of all records of accounts
  without O_SYNC: 139sec
  with O_SYNC: 639sec
- CHECKPOINT command for flushing 1.6GB of dirty buffers
  without O_SYNC: 24sec
  with O_SYNC: 126sec

To mitigate this problem, I sorted dirty buffers by their relfilenode
and block numbers and wrote multiple pages that are adjacent both on
memory and on disk. The result was:

- COPY of accounts (5m records) and CHECKPOINT command after COPY

  227sec
- UPDATE of all records of accounts
  569sec
- CHECKPOINT command for flushing 1.6GB of dirty buffers
  71sec

Still bad...


(2) Can't utilize tablespaces

Though I didn't evaluate, update activity would be much less efficient
with O_SYNC than with fsync() when using multiple tablespaces, because
there is only one bgwriter.


Anyone can solve these problems?
One of my ideas is to use scattered I/O. I hear that readv()/writev()
became able to do real scattered I/O since kernel 2.6 (RHEL4.0).  With
kernels before 2.6, readv()/writev() just performed I/Os sequentially.
Windows has provided reliable scattered I/O for years.

Another idea is to use async I/O, possibly combined with multiple
bgwriter approach on platforms where async I/O is not available. How
about the chance Josh-san has brought?



Re: Load distributed checkpoint V3

From
Greg Smith
Date:
On Thu, 5 Apr 2007, Heikki Linnakangas wrote:

> The purpose of the bgwriter_all_* settings is to shorten the duration of
> the eventual checkpoint. The reason to shorten the checkpoint duration
> is to limit the damage to other I/O activity it causes. My thinking is
> that assuming the LDC patch is effective (agreed, needs more testing) at
> smoothening the checkpoint, the duration doesn't matter anymore. Do you
> want to argue there's other reasons to shorten the checkpoint duration?

My testing results suggest that LDC doesn't smooth the checkpoint usefully
when under a high (>30 client here) load, because (on Linux at least) the
way the OS caches writes clashes badly with how buffers end up being
evicted if the buffer pool fills back up before the checkpoint is done.
In that context, anything that slows down the checkpoint duration is going
to make the problem worse rather than better, because it makes it more
likely that the tail end of the checkpoint will have to fight with the
clients for write bandwidth, at which point they both suffer.  If you just
get the checkpoint done fast, the clients can't fill the pool as fast as
the BufferSync is writing it out, and things are as happy as they can be
without a major rewrite to all this code.  I can get a tiny improvement in
some respects by delaying 2-5 seconds between finishing the writes and
calling fsync, because that gives Linux a moment to usefully spool some of
the data to the disk controller's cache; beyond that any additional delay
is a problem.

Since it's only the high load cases I'm having trouble dealing with, this
basically makes it a non-starter for me.  The focus on checkpoint_timeout
and ignoring checkpoint_segments in the patch is also a big issue for me.
At the same time, I recognize that the approach taken in LDC probably is a
big improvement for many systems, it's just a step backwards for my
highest throughput one.  I'd really enjoy hearing some results from
someone else.

> The number of buffers evicted by normal backends in a bgwriter_delay period
> is simple to keep track of, just increase a counter in StrategyGetBuffer and
> reset it when bgwriter wakes up.

I see you've already found the other helpful Itagaki patch in this area.
I know I would like to see his code for tracking evictions commited, then
I'd like that to be added as another counter in pg_stat_bgwriter (I
mentioned that to Magnus in passing when he was setting the stats up but
didn't press it because of the patch dependency).  Ideally, and this idea
was also in Itagaki's patch with the writtenByBgWriter/ByBackEnds debug
hook, I think it's important that you know how every buffer written to
disk got there--was it a background writer, a checkpoint, or an eviction
that wrote it out?  Track all those and you can really learn something
about your write performance, data that's impossible to collect right now.

However, as Itagaki himself points out, doing something useful with
bgwriter_lru_maxpages is only one piece of automatically tuning the
background writer.  I hate to join in on chopping his patches up, but
without some additional work I don't think the exact auto-tuning logic he
then applies will work in all cases, which could make it more a problem
than the current crude yet predictable method.  The whole way
bgwriter_lru_maxpages and num_to_clean play off each other in his code
currently has a number of failure modes I'm concerned about.  I'm not sure
if a re-write using a moving-average approach (as I did in my auto-tuning
writer prototype and as Tom just suggested here) will be sufficient to fix
all of them.  Was already on my to-do list to investigate that further.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load distributed checkpoint V3

From
Greg Smith
Date:
On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote:

> could anyone evaluate O_SYNC approach again that commercial databases
> use and tell me if and why PostgreSQL's fsync() approach is better than
> theirs?

I noticed a big improvement switching the WAL to use O_SYNC (+O_DIRECT)
instead of fsync on my big and my little servers with battery-backed
cache, so I know sync writes perform reasonably well on my hardware.
Since I've had problems with the fsync at checkpoint time, I did a similar
test to yours recently, adding O_SYNC to the open calls and pulling the
fsyncs out to get a rough idea how things would work.

Performance was reasonable most of the time, but when I hit a checkpoint
with a lot of the buffer cache dirty it was incredibly bad.  It took
minutes to write everything out, compared with a few seconds for the
current case, and the background writer was too sluggish as well to help.
This appears to match your data.

If you compare how Oracle handles their writes and checkpoints to the
Postgres code, it's obvious they have a different architecture that
enables them to support sync writing usefully.  I'd recommend the Database
Writer Process section of
http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm
as an introduction for those not familiar with that; it's interesting
reading for anyone tinking with background writer code.

It would be great to compare performance of the current PostgreSQL code
with a fancy multiple background writer version using the latest sync
methods or AIO; there have actually been multiple updates to improve
O_SYNC writes within Linux during the 2.6 kernel series that make this
more practical than ever on that platform.  But as you've already seen,
the performance hurdle to overcome is significant, and it would have to be
optional as a result.  When you add all this up--have to keep the current
non-sync writes around as well, need to redesign the whole background
writer/checkpoint approach around the idea of sync writes, and the
OS-specific parts that would come from things like AIO--it gets real
messy.  Good luck drumming up support for all that when the initial
benchmarks suggest it's going to be a big step back.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Load distributed checkpoint V3

From
"Takayuki Tsunakawa"
Date:
From: "Greg Smith" <gsmith@gregsmith.com>
> If you compare how Oracle handles their writes and checkpoints to
the
> Postgres code, it's obvious they have a different architecture that
> enables them to support sync writing usefully.  I'd recommend the
Database
> Writer Process section of
>
http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm
> as an introduction for those not familiar with that; it's
interesting
> reading for anyone tinking with background writer code.

Hmm... what makes you think that sync writes is useful for Oracle and
not for PostgreSQL? The process architecture is similar; bgwriter
performs most of writes in PostgreSQL, while DBWn performs all writes
in Oracle. The difference is that Oracle can assure crash recovery
time by writing dirby buffers periodically in the order of their LSN.


> It would be great to compare performance of the current PostgreSQL
code
> with a fancy multiple background writer version using the latest
sync
> methods or AIO; there have actually been multiple updates to improve
> O_SYNC writes within Linux during the 2.6 kernel series that make
this
> more practical than ever on that platform.  But as you've already
seen,
> the performance hurdle to overcome is significant, and it would have
to be
> optional as a result.  When you add all this up--have to keep the
current
> non-sync writes around as well, need to redesign the whole
background
> writer/checkpoint approach around the idea of sync writes, and the
> OS-specific parts that would come from things like AIO--it gets real
> messy.  Good luck drumming up support for all that when the initial
> benchmarks suggest it's going to be a big step back.

I agree with you in that write method has to be optional until there's
enough data from the field that help determine which is better.

... It's a pity not to utilize async I/O and Josh-san's offer. I hope
it will be used some day. I think OS developers have evolved async I/O
for databases.





Re: Load distributed checkpoint V3

From
"Simon Riggs"
Date:
On Fri, 2007-04-06 at 02:53 -0400, Greg Smith wrote:
> If you compare how Oracle handles their writes and checkpoints to the
> Postgres code, it's obvious they have a different architecture that
> enables them to support sync writing usefully.  I'd recommend the
> Database
> Writer Process section of
> http://www.lc.leidenuniv.nl/awcourse/oracle/server.920/a96524/c09procs.htm
> as an introduction for those not familiar with that; it's interesting
> reading for anyone tinking with background writer code.

Oracle does have a different checkpointing technique and we know it is
patented, so we need to go carefully there, especially when directly
referencing documentation.

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: Load distributed checkpoint V3

From
Greg Smith
Date:
On Fri, 6 Apr 2007, Takayuki Tsunakawa wrote:

> Hmm... what makes you think that sync writes is useful for Oracle and
> not for PostgreSQL?

They do more to push checkpoint-time work in advance, batch writes up more
efficiently, and never let clients do the writing.  All of which make for
a different type of checkpoint.

Like Simon points out, even if it were conceivable to mimic their design
it might not even be legally feasible.  The point I was trying to make is
this:  you've been saying that Oracle's writing technology has better
performance in this area, which is probably true, and suggesting the cause
of that was their using O_SYNC writes.  I wanted to believe that and even
tested out a prototype.  The reality here appears to be that their
checkpoints go smoother *despite* using the slower sync writes because
they're built their design around the limitations of that write method.

I suspect it would take a similar scale of redesign to move Postgres in
that direction; the issues you identified (the same ones I ran into) are
not so easy to resolve.  You're certainly not going to move anybody in
that direction by throwing a random comment into a discussion on the
patches list about a feature useful *right now* in this area.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Load distributed checkpoint V4

From
ITAGAKI Takahiro
Date:
Here is an updated version of LDC patch (V4).

- Refactor the codes to minimize the impact of changes.
- Progress of checkpoint is controlled not only based on checkpoint_timeout
  but also checkpoint_segments. -- Now it works better with large
  checkpoint_timeout and small checkpoint_segments.

We can control the delay of checkpoints using three parameters:
checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent.
If we set all of the values to zero, checkpoint behaves as it was.


Heikki Linnakangas <heikki@enterprisedb.com> wrote:

> I'd suggest rearranging the code so that BgBufferSync and mdsync would
> basically stay like they are without the patch; the signature wouldn't
> change. To do the naps during a checkpoint, inject calls to new
> functions like CheckpointWriteNap() and CheckpointFsyncNap() inside
> BgBufferSync and mdsync. Those nap functions would check if enough
> progress has been made since last call and sleep if so.

Yeah, it makes LDC less intrusive. Now the code flow in checkpoints stay
as it was and the nap-functions are called periodically in BufferSync()
and smgrsync(). But the signatures of some functions needed small changes;
the argument 'immediate' was added.

> The nap-functions would call
> BgWriteLRUBuffers if more than bgwriter_delay milliseconds have passed
> since last call to it.

Only LRU buffers are written in nap and sync phases in the new patch.
The ALL activity of bgwriter was primarily designed to write drity buffers
on ahead of checkpoints, so the writes were not needed *in* checkpoints.

Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center


Attachment

Re: Load distributed checkpoint V4

From
Heikki Linnakangas
Date:
ITAGAKI Takahiro wrote:
> Here is an updated version of LDC patch (V4).

Thanks! I'll start testing.

> - Progress of checkpoint is controlled not only based on checkpoint_timeout
>   but also checkpoint_segments. -- Now it works better with large
>   checkpoint_timeout and small checkpoint_segments.

Great, much better now. I like the concept of "progress" used in the
calculations. We might want to call GetCheckpointProgress something
else, though. It doesn't return the amount of progress made, but rather
the amount of progress we should've made up to that point or we're in
danger of not completing the checkpoint in time.

> We can control the delay of checkpoints using three parameters:
> checkpoint_write_percent, checkpoint_nap_percent and checkpoint_sync_percent.
> If we set all of the values to zero, checkpoint behaves as it was.

The nap and sync phases are pretty straightforward. The write phase,
however, behaves a bit differently

In the nap phase, we just sleep until enough time/segments has passed,
where enough is defined by checkpoint_nap_percent. However, if we're
already past checkpoint_write_percent at the beginning of the nap, I
think we should clamp the nap time so that we don't run out of time
until the next checkpoint because of sleeping.

In the sync phase, we sleep between each fsync until enough
time/segments have passed, assuming that the time to fsync is
proportional to the file length. I'm not sure that's a very good
assumption. We might have one huge files with only very little changed
data, for example a logging table that is just occasionaly appended to.
If we begin by fsyncing that, it'll take a very short time to finish,
and we'll then sleep for a long time. If we then have another large file
to fsync, but that one has all pages dirty, we risk running out of time
because of the unnecessarily long sleep. The segmentation of relations
limits the risk of that, though, by limiting the max. file size, and I
don't really have any better suggestions.

In the write phase, bgwriter_all_maxpages is also factored in the
sleeps. On each iteration, we write bgwriter_all_maxpages pages and then
we sleep for bgwriter_delay msecs. checkpoint_write_percent only
controls the maximum amount of time we try spend in the write phase, we
skip the sleeps if we're exceeding checkpoint_write_percent, but it can
very well finish earlier. IOW, bgwriter_all_maxpages is the *minimum*
amount of pages to write between sleeps. If it's not set, we use
WRITERS_PER_ABSORB, which is hardcoded to 1000.

The approach of writing min. N pages per iteration seems sound to me. By
setting N we can control the maximum impact of a checkpoint under normal
circumstances. If there's very little work to do, it doesn't make sense
to stretch the write of say 10 buffers across a 15 min period; it's
indeed better to finish the checkpoint earlier. It's similar to
vacuum_cost_limit in that sense. But using bgwriter_all_maxpages for it
doesn't feel right, we should at least name it differently. The default
of 1000 is a bit high as well, with the default bgwriter_delay that adds
up to 39MB/s. That's ok for decent a I/O subsystem, but the default
really should be something that will still leave room for other I/O on a
small single-disk server.

Should we try doing something similar for the sync phase? If there's
only 2 small files to fsync, there's no point sleeping for 5 minutes
between them just to use up the checkpoint_sync_percent budget.

Should we give a warning if you set the *_percent settings so that they
exceed 100%?

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Re: Load distributed checkpoint V4

From
Greg Smith
Date:
On Thu, 19 Apr 2007, Heikki Linnakangas wrote:

> In the sync phase, we sleep between each fsync until enough time/segments
> have passed, assuming that the time to fsync is proportional to the file
> length. I'm not sure that's a very good assumption.

I've been making scatter plots of fsync time vs. amount written to the
database for a couple of months now, and while there's a trend there it's
not a linear one based on data written.  Under Linux, to make a useful
prediction about how long a fsync will take you first need to consider how
much dirty data is already in the OS cache (the "Dirty:" figure in
/proc/meminfo) before the write begins, relative to the kernel parameters
that control write behavior.  Combine that with some knowledge of the
caching behavior of the controller/disk combination you're using, and it's
just barely possible to make a reasonable estimate.  Any less information
than all that and you really have very little basis on which to guess how
long it's going to take.

Other operating systems are going to give completely different behavior
here, which of course makes the problem even worse.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD