Thread: Missing clog, PITR

Missing clog, PITR

From
Patryk Sidzina
Date:
Hello everyone,

my company has been using pg_standby as a replication solution for a
while and it has been working great for our needs. Unfortunately, about
once a month we get the following error on the standby bases:

vacuumdb: vacuuming of database "xxxx" failed: ERROR:  could not access
status of transaction 3625953267
DETAIL:  Could not open file "pg_clog/0D81": No such file or directory.

I've read your solutions to similar problems posted on this mailing list
but our situation is a bit different. First, we cannot lose any data so
creating a zero-filled clog is out of the question (we did try that
though, but the error came up again with older clogs missing). Second, I
don't believe this is a memory problem (as suggested in other posts)
because this happens on two different machines (we have one master and
two slaves). The important part is that the master database never had
this problem, it only happens on the slaves.

So my questions are:
1) how do the clogs relate to wal shipping based replication? Clearly
the master doesn't need that clog but the slave does.

2) could 'pg_clearxlogtail' in archive_command be a cause of this? This
is our archive_command:
'cat %p | pg_clearxlogtail | gzip -c |
ssh slavehost "cat > /var/lib/postgresql/replication/in/%f"'

3) is there a faster way to debug this problem? Clogs fill slowly. It
takes about a month on a very busy production server for a clog to be
removed by master DB.


More info:
PostgreSQL 8.2.14 64-bit (though this happened in older versions also)
pg_standby from PostgreSQL 8.3.6


I'd be grateful for any ideas on this problem. If needed I can provide
more setup info.

--
Patryk Sidzina


Re: Missing clog, PITR

From
Richard Huxton
Date:
On 22/02/10 09:57, Patryk Sidzina wrote:
> Hello everyone,
>
> my company has been using pg_standby as a replication solution for a
> while and it has been working great for our needs. Unfortunately, about
> once a month we get the following error on the standby bases:
>
> vacuumdb: vacuuming of database "xxxx" failed: ERROR:  could not access
> status of transaction 3625953267
> DETAIL:  Could not open file "pg_clog/0D81": No such file or directory.

Is vacuum running on the standby? It's been a long day here, but I'm not
sure that makes sense.

--
   Richard Huxton
   Archonet Ltd

Re: Missing clog, PITR

From
Patryk Sidzina
Date:
> >
> > my company has been using pg_standby as a replication solution for a
> > while and it has been working great for our needs. Unfortunately, about
> > once a month we get the following error on the standby bases:
> >
> > vacuumdb: vacuuming of database "xxxx" failed: ERROR:  could not access
> > status of transaction 3625953267
> > DETAIL:  Could not open file "pg_clog/0D81": No such file or directory.
>
> Is vacuum running on the standby? It's been a long day here, but I'm not
> sure that makes sense.
>

You cannot vacuum the warm standby in PG 8.2.

--
Patryk Sidzina


Re: Missing clog, PITR

From
Greg Smith
Date:
Patryk Sidzina wrote:
> 1) how do the clogs relate to wal shipping based replication? Clearly
> the master doesn't need that clog but the slave does.
>
They should just be kept in sync.  There's some useful background on
this topic at
http://old.nabble.com/control-the-number-of-clog-files-and-xlog-files-td19173165.html

> 2) could 'pg_clearxlogtail' in archive_command be a cause of this? This
> is our archive_command:
> 'cat %p | pg_clearxlogtail | gzip -c |
> ssh slavehost "cat > /var/lib/postgresql/replication/in/%f"'
>

Kind of doubt that.  If pg_clearxlogtail were mangling your data, I'd
expect a more dramatic failure to restore.

> 3) is there a faster way to debug this problem? Clogs fill slowly. It
> takes about a month on a very busy production server for a clog to be
> removed by master DB.
>

You could create a bunch of transactions and then freeze things,
following the ideas in the reference I suggested above.

> More info:
> PostgreSQL 8.2.14 64-bit (though this happened in older versions also)
> pg_standby from PostgreSQL 8.3.6
>

There was a bug in this area fixed in 8.2.10:
http://www.postgresql.org/docs/8.2/static/release-8-2-10.html

"Fix potential miscalculation of datfrozenxid (Alvaro)

    *

      This error may explain some recent reports of failure to remove
      old pg_clog data."

If you were running this database with a version before that, I wonder
if maybe there's still some junk left behind from that old, buggy
version that's causing your issues.  You might try doing some manual
VACUUM or VACUUM FREEZE work to remove any lingering issues and then
re-create your standby systems afterwards.  I'm not quite familiar
enough with this specific bug to suggest a clearer resolution path, or
if in fact this is the same issue you're seeing.  It sure seems possible
they're related though.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com   www.2ndQuadrant.us


Re: Missing clog, PITR

From
Patryk Sidzina
Date:
> > 1) how do the clogs relate to wal shipping based replication? Clearly
> > the master doesn't need that clog but the slave does.
> >
> They should just be kept in sync.  There's some useful background on
> this topic at
> http://old.nabble.com/control-the-number-of-clog-files-and-xlog-files-td19173165.html
>

The thing is they are in sync but somehow the slave needs more clogs
than the master, otherwise it won't start up.


> > 3) is there a faster way to debug this problem? Clogs fill slowly. It
> > takes about a month on a very busy production server for a clog to be
> > removed by master DB.
> >
>
> You could create a bunch of transactions and then freeze things,
> following the ideas in the reference I suggested above.

Thanks for the info, I will try it as soon as I can.

> > More info:
> > PostgreSQL 8.2.14 64-bit (though this happened in older versions also)
> > pg_standby from PostgreSQL 8.3.6
> >
>
> There was a bug in this area fixed in 8.2.10:
> http://www.postgresql.org/docs/8.2/static/release-8-2-10.html
>
> "Fix potential miscalculation of datfrozenxid (Alvaro)
>
>     *
>
>       This error may explain some recent reports of failure to remove
>       old pg_clog data."
>
> If you were running this database with a version before that, I wonder
> if maybe there's still some junk left behind from that old, buggy
> version that's causing your issues.  You might try doing some manual
> VACUUM or VACUUM FREEZE work to remove any lingering issues and then
> re-create your standby systems afterwards.  I'm not quite familiar
> enough with this specific bug to suggest a clearer resolution path, or
> if in fact this is the same issue you're seeing.  It sure seems possible
> they're related though.


We run VACUUM on the master db every night (and VACUUM FULL on weekends)
and test the standby db using LVM snapshots. If our test generates the
missing clog error we have to recreate the standby from scratch. So I
doubt there is any junk left over, but I will look into this bug more
carefully anyway. Again, thank you for your suggestions.


--
Patryk Sidzina