Re: Avoiding adjacent checkpoint records - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Avoiding adjacent checkpoint records
Date
Msg-id CA+TgmoZoLnowRpi0CgTmghfmwE4uquj38ZvJO2MswAR21JX6Xg@mail.gmail.com
Whole thread Raw
In response to Avoiding adjacent checkpoint records  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Avoiding adjacent checkpoint records
List pgsql-hackers
On Wed, Jun 6, 2012 at 3:08 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In commit 18fb9d8d21a28caddb72c7ffbdd7b96d52ff9724, Simon modified the
> rule for when to skip checkpoints on the grounds that not enough
> activity has happened since the last one.  However, that commit left the
> comment block about it in a nonsensical state:
>
>    * If this isn't a shutdown or forced checkpoint, and we have not switched
>    * to the next WAL file since the start of the last checkpoint, skip the
>    * checkpoint.  The idea here is to avoid inserting duplicate checkpoints
>    * when the system is idle. That wastes log space, and more importantly it
>    * exposes us to possible loss of both current and previous checkpoint
>    * records if the machine crashes just as we're writing the update.
>    * (Perhaps it'd make even more sense to checkpoint only when the previous
>    * checkpoint record is in a different xlog page?)

IIRC, the inspiration for the change was that we were getting a
never-ending series of checkpoints even when nothing was happening at
all:

http://archives.postgresql.org/pgsql-hackers/2011-10/msg00207.php

I felt (and still feel) that this was misguided.  I understand why
people don't want a completely idle system to checkpoint; but I can't
recall a single complaint about a checkpoint on a system low but not
zero activity.  Checkpoints are pretty cheap when there isn't much
data to flush.  The flip side is that I know of real customers who
would have suffered real data loss had this code been present in the
server version they were using.  Checkpoints are the *only* mechanism
by which SLRU pages get flushed to disk on a mostly-idle system.  That
means if something happens to your pg_xlog directory, and you haven't
had a checkpoint, you're screwed.  Letting data sit in memory for
hours, days, weeks, or months because we haven't filled up a WAL
segment is just terrible.  The first user who loses a transaction that
was committed a month ago after running pg_resetxlog is going to hit
the ceiling, and I don't blame them.

It wouldn't be so bad if we had background writing for SLRU pages,
because then you could figure that the OS would eventually have a
chance to write the page out... but we don't.  It'll just sit there in
shared memory, dirty, forever.  CLOG data in particular is FAR too
precious to take that kind of chance with.

I don't think there's much sense in doing push-ups to avoid having the
current and previous checkpoint records are on the same XLOG page.  If
the system is so nearly idle that you get two checkpoint records in
the same 8k block, and that block gets corrupted, it is extremely
likely that you can run pg_resetxlog and be OK.  If not, that means
there were more XLOG records after the corrupted page, and you're not
going to be able to replay those anyway, whether the checkpoint
records are in the same 8k block or not.  So I'm not seeing how your
proposal is buying us any additional measure of safety that we don't
already have.  Of course, if we had a way to skip over the corrupted
portion of WAL and pick up replaying records after that, that would be
very useful (even though you'd have to view the resulting database
with extreme suspicion), but without that I don't see that finding the
previous checkpoint record is doing much for us.  Either way, you've
potentially lost changes that were covered by WAL records emitted
after the most recent checkpoint.  The only thing we can really do is
make sure that there aren't likely to be too many more unreplayed
records after the last checkpoint segment, which goes back to my
previous complaint.

As a side point, another reason not to make the checkpoint record
consume the rest of the page is that, for scalability reasons, we want
to minimize the amount of calculation that has to be done while
holding WALInsertLock, and have as much of the computation as possible
get done before acquiring it.  XLogInsert() is already way more
complicated than anything anyone ought to be doing while holding a
heavily-contended LWLock.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: WalSndWakeup() and synchronous_commit=off
Next
From: Sergey Koposov
Date:
Subject: Re: 9.2beta1, parallel queries, ReleasePredicateLocks, CheckForSerializableConflictIn in the oprofile