Home > mailing lists

Re: Checkpoint cost, looks like it is WAL/CRC - Mailing list pgsql-hackers

From	Bruce Momjian
Subject	Re: Checkpoint cost, looks like it is WAL/CRC
Date	July 6, 2005 19:23:09
Msg-id	200507062222.j66MMcm04984@candle.pha.pa.us Whole thread Raw
In response to	Re: Checkpoint cost, looks like it is WAL/CRC (Simon Riggs <simon@2ndquadrant.com>)
Responses	Re: Checkpoint cost, looks like it is WAL/CRC Re: Checkpoint cost, looks like it is WAL/CRC Re: Checkpoint cost, looks like it is WAL/CRC
List	pgsql-hackers

Tree view

Simon Riggs wrote:
> On Wed, 2005-06-29 at 23:23 -0400, Tom Lane wrote:
> > Josh Berkus <josh@agliodbs.com> writes:
> > >> Uh, what exactly did you cut out?  I suggested dropping the dumping of
> > >> full page images, but not removing CRCs altogether ...
> > 
> > > Attached is the patch I used.
> > 
> > OK, thanks for the clarification.  So it does seem that dumping full
> > page images is a pretty big hit these days.  
> 
> Yes the performance results are fairly damning. That's a shame, I
> convinced myself that the CRC32 and block-hole compression was enough.
> 
> The 50% performance gain isn't the main thing for me. The 10 sec drop in
> response time immediately after checkpoint is the real issue. Most sites
> are looking for good response as an imperative, rather than throughput.

Yep.

> No defense required. As you say, it was the best idea at the time.
> 
> > It seems like we have two basic alternatives:
> > 
> > 1. Offer a GUC to turn off full-page-image dumping, which you'd use only
> > if you really trust your hardware :-(
> > 
> > 2. Think of a better defense against partial-page writes.
> > 
> > I like #2, or would if I could think of a better defense.  Ideas anyone?
> 
> Well, I'm all for #2 if we can think of one that will work. I can't.
> 
> Option #1 seems like the way forward, but I don't think it is
> sufficiently safe just to have the option to turn things off.

Well, I added #1 yesterday as 'full_page_writes', and it has the same
warnings as fsync (namely, on crash, be prepared to recovery or check
your system thoroughly.

As far as #2, my posted proposal was to write the full pages to WAL when
they are written to the file system, and not when they are first
modified in the shared buffers --- the goal being that it will even out
the load, and it will happen in a non-critical path, hopefully by the
background writer or at checkpoint time.

> With wal_changed_pages= off *any* crash would possibly require an
> archive recovery, or a replication rebuild. It's good that we now have
> PITR, but we do also have other options for availability. Users of
> replication could well be amongst the first to try out this option. 

Seems it is similar to fsync in risk, which is not a new option.

> The problem is that you just wouldn't *know* whether the possibly was
> yes or no. The temptation would be to assume "no" and just continue,
> which could lead to data loss. And that would lead to a lack of trust in
> PostgreSQL and eventual reputational loss. Would I do an archive
> recovery, or would I trust that RAID array had written everything
> properly? With an irate Web Site Manager saying "you think? it might?
> maybe? You mean you don't know???"

That is a serious problem, but the same problem we have in turning off
fsync.

> During recovery, if a full page image is not available, we would read
> the page from the database and check that the first and last LSNs match.
> If they do, then the page is not torn and recovery can be successful. If
> they do not match, then we attempt to continue recovery, but issue a
> warning that torn page has been detected and a full archive recovery is
> recommended. It is likely that the recovery itself will fail almost
> immediately following this, since changes will try to be made to a page
> in the wrong state to receive it, but there's no harm in trying....

I like the idea of checking the page during recovery so we don't have to
check all the pages, just certain pages.

> Like this specific idea or not, I'm saying that we need a tell-tale: a
> way of knowing whether we have a torn page, or not. That way we can
> safely continue to rely upon crash recovery.
> 
> Tom, I think you're the only person that could or would be trusted to
> make such a change. Even past the 8.1 freeze, I say we need to do
> something now on this issue.

I think if we document full_page_writes as similar to fsync in risk, we
are OK for 8.1, but if something can be done easily, it sounds good.

Now that we have a GUC we can experiment with the full page write load
and see how it can be improved.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073

pgsql-hackers by date:

From: Bruce Momjian
Date: 06 July 2005, 18:40:41
Subject: Re: timezone changes break windows and cygwin

From: Bruce Momjian
Date: 06 July 2005, 19:36:35
Subject: Re: Schedule for 8.1 feature freeze

Re: Checkpoint cost, looks like it is WAL/CRC - Mailing list pgsql-hackers

Previous

Next