Home > mailing lists

Re: Hot Backup with rsync fails at pg_clog if under load - Mailing list pgsql-hackers

From	Florian Pflug
Subject	Re: Hot Backup with rsync fails at pg_clog if under load
Date	October 26, 2011 12:49:19
Msg-id	12E895D5-C6D0-4E71-B9C3-85E1BC14E6B8@phlo.org Whole thread Raw
In response to	Re: Hot Backup with rsync fails at pg_clog if under load (Florian Pflug <fgp@phlo.org>)
Responses	Re: Hot Backup with rsync fails at pg_clog if under load
List	pgsql-hackers

Tree view

On Oct26, 2011, at 15:57 , Florian Pflug wrote:
> As you said, the CLOG page corresponding to nextId
> *should* always be accessible at the start of recovery (Unless whole file
> has been removed by VACUUM, that is). So we shouldn't need to extends CLOG.
> Yet the error suggest that the CLOG is, in fact, too short. What I said
> is that we shouldn't apply any fix (for the CLOG problem) before we understand
> the reason for that apparent contradiction.

Ha! I think I've got a working theory.

In CreateCheckPoint(), we determine the nextId that'll go into the checkpoint
record, and then call CheckPointGuts() which does the actual writing and fsyncing.
So far, that fine. If a transaction ID is assigned before we compute the
checkpoint's nextXid, we'll extend the CLOG accordingly, and CheckPointGuts() will
make sure the new CLOG page goes to disk.

But, if wal_level = hot_standby, we also call LogStandbySnapshot() in
CreateCheckPoint(), and we do that *after* CheckPointGuts(). Which would be
fine too, except that LogStandbySnapshot() re-assigned the *current* value of
ShmemVariableCache->nextXid to the checkpoint's nextXid field.

Thus, if the CLOG is extended after (or in the middle of) CheckPointGuts(), but
before LogStandbySnapshot(), then we end up with a nextXid in the checkpoint
whose CLOG page hasn't necessarily made it to the disk yet. The longer CheckPointGuts()
takes to finish it's work the more likely it becomes (assuming that CLOG writing
and syncing doesn't happen at the very end). This fits the OP's observation ob the
problem vanishing when pg_start_backup() does an immediate checkpoint.

I dunno how to this fix, though, since I don't really understand why
LogStandbySnapshot() needs to modify the checkpoint's nextXid.Simon, is there some
documentation on what assumptions the hot standby code makes about the various XID
fields included in a checkpoint?

best regards,
Florian Pflug

pgsql-hackers by date:

From: Robert Haas
Date: 26 October 2011, 12:47:53
Subject: Re: pgsql_fdw, FDW for PostgreSQL server

From: Dimitri Fontaine
Date: 26 October 2011, 12:49:23
Subject: Re: pgsql_fdw, FDW for PostgreSQL server

Re: Hot Backup with rsync fails at pg_clog if under load - Mailing list pgsql-hackers

Previous

Next