Thread: Where art thou pg_clog?

Where art thou pg_clog?

From
Casey Duncan
Date:
We have a production system with multiple identical database
instances on the same hardware, with the same configuration, running
databases with the exact same schema. They each have different data,
but the database sizes and load patterns are almost exactly the same.

We are running pg 8.1.5 (upgraded the day before 8.1.6 came out, oh
well ;^) and since then we have noticed the following error on two of
the servers:

2007-02-15 00:35:03.324 PST ERROR:  could not access status of
transaction 2565134864
2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/
098E": No such file or directory

The first time this happened, I chalked it up to some kind of disk
corruption based on the mailing list archives. So I dumped the
databases, did a fresh initdb, forced an fsck (these run with a jfs
data partition and an ext2 wal partition) which found no problems and
then reloaded the databases.

Now about a week later I see the same problem on different server. We
never saw this problem running 8.1.3 on these same machines over many
months, so I'm beginning to get suspect that something we changed
since running 8.1.3 is to blame. Before the upgrade these systems ran
postgres 8.1.3 and slony 1.1.5. Now they run postgres 8.1.5 and slony
1.2.6 (I don't know that the slony version is important, I add it
here for completeness). Nothing else important has changed on these
boxes. I see the 8.1.8 is out now, though nothing I see in the
release notes seems relevant to this issue.

Here are some specific things I'd like to know:

1. Is it possible to "fix" this problem without an dumpall/initdb/
restore. That takes many hours and can only be done when I'm supposed
to be at home relaxing (yeah right) ;^) FWIW, the system is
functioning fine right now from what I can tell, save the above
errors in the log every few minutes.

2. What more info can I give to figure out the "cause" of this. Are
there files I can inspect to find out more?

3. Is it possible that this is a side-affect of the upgrade to 8.1.5?

Thanks for any insights,

-Casey

Re: Where art thou pg_clog?

From
Alvaro Herrera
Date:
Casey Duncan wrote:
> We have a production system with multiple identical database
> instances on the same hardware, with the same configuration, running
> databases with the exact same schema. They each have different data,
> but the database sizes and load patterns are almost exactly the same.
>
> We are running pg 8.1.5 (upgraded the day before 8.1.6 came out, oh
> well ;^) and since then we have noticed the following error on two of
> the servers:
>
> 2007-02-15 00:35:03.324 PST ERROR:  could not access status of
> transaction 2565134864
> 2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/
> 098E": No such file or directory

Can you relate it to autovacuum?

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Where art thou pg_clog?

From
Peter Eisentraut
Date:
Casey Duncan wrote:
> 2007-02-15 00:35:03.324 PST ERROR:  could not access status of
> transaction 2565134864
> 2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/
> 098E": No such file or directory
>
> The first time this happened, I chalked it up to some kind of disk
> corruption based on the mailing list archives. So I dumped the
> databases, did a fresh initdb, forced an fsck (these run with a jfs
> data partition and an ext2 wal partition) which found no problems and
> then reloaded the databases.
>
> Now about a week later

Unless you actually executed 2565134864 transactions in that one week,
this is still data corruption.

Check for faulty memory.

--
Peter Eisentraut
http://developer.postgresql.org/~petere/

Re: Where art thou pg_clog?

From
Casey Duncan
Date:
On Feb 15, 2007, at 1:50 PM, Peter Eisentraut wrote:

> Casey Duncan wrote:
>> 2007-02-15 00:35:03.324 PST ERROR:  could not access status of
>> transaction 2565134864
>> 2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/
>> 098E": No such file or directory
>>
>> The first time this happened, I chalked it up to some kind of disk
>> corruption based on the mailing list archives. So I dumped the
>> databases, did a fresh initdb, forced an fsck (these run with a jfs
>> data partition and an ext2 wal partition) which found no problems and
>> then reloaded the databases.
>>
>> Now about a week later
>
> Unless you actually executed 2565134864 transactions in that one week,
> this is still data corruption.
>
> Check for faulty memory.

I'd be more inclined to agree with you if it happened on only one
server machine. But this has now happened on two different machines
in the space of a week.

My understanding is that the transaction id logged is garbage because
the bookkeeping fields have been clobbered for some tuple(s). The one
last week was really low (like < 1000).

-Casey

Re: Where art thou pg_clog?

From
Casey Duncan
Date:
On Feb 15, 2007, at 1:46 PM, Alvaro Herrera wrote:

> Casey Duncan wrote:
>> We have a production system with multiple identical database
>> instances on the same hardware, with the same configuration, running
>> databases with the exact same schema. They each have different data,
>> but the database sizes and load patterns are almost exactly the same.
>>
>> We are running pg 8.1.5 (upgraded the day before 8.1.6 came out, oh
>> well ;^) and since then we have noticed the following error on two of
>> the servers:
>>
>> 2007-02-15 00:35:03.324 PST ERROR:  could not access status of
>> transaction 2565134864
>> 2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/
>> 098E": No such file or directory
>
> Can you relate it to autovacuum?

Maybe. Here's what I get when I crank up the logging to debug4:

2007-02-15 14:20:48.771 PST DEBUG:  StartTransaction
2007-02-15 14:20:48.771 PST DEBUG:  name: unnamed; blockState:
DEFAULT; state: INPROGR, xid/subid/cid: 3429052708/1/0, nestlvl: 1,
children: <>
2007-02-15 14:20:48.771 PST DEBUG:  vacuuming "pg_catalog.pg_statistic"
2007-02-15 14:20:48.771 PST ERROR:  could not access status of
transaction 2565134864
2007-02-15 14:20:48.772 PST DETAIL:  could not open file "pg_clog/
098E": No such file or directory
2007-02-15 14:20:48.772 PST DEBUG:  proc_exit(0)
2007-02-15 14:20:48.772 PST DEBUG:  shmem_exit(0)
2007-02-15 14:20:48.773 PST DEBUG:  exit(0)
2007-02-15 14:20:48.775 PST DEBUG:  reaping dead processes

does that imply that it is the pg_statistic table that is hosed?

Interestingly I can manually vacuum that table in all of the
databases on this machine without provoking the error.

-Casey


Re: Where art thou pg_clog?

From
Alvaro Herrera
Date:
Casey Duncan wrote:
>
> On Feb 15, 2007, at 1:46 PM, Alvaro Herrera wrote:
>
> >Casey Duncan wrote:
> >>We have a production system with multiple identical database
> >>instances on the same hardware, with the same configuration, running
> >>databases with the exact same schema. They each have different data,
> >>but the database sizes and load patterns are almost exactly the same.
> >>
> >>We are running pg 8.1.5 (upgraded the day before 8.1.6 came out, oh
> >>well ;^) and since then we have noticed the following error on two of
> >>the servers:
> >>
> >>2007-02-15 00:35:03.324 PST ERROR:  could not access status of
> >>transaction 2565134864
> >>2007-02-15 00:35:03.325 PST DETAIL:  could not open file "pg_clog/
> >>098E": No such file or directory
> >
> >Can you relate it to autovacuum?
>
> Maybe. Here's what I get when I crank up the logging to debug4:
>
> 2007-02-15 14:20:48.771 PST DEBUG:  StartTransaction
> 2007-02-15 14:20:48.771 PST DEBUG:  name: unnamed; blockState:
> DEFAULT; state: INPROGR, xid/subid/cid: 3429052708/1/0, nestlvl: 1,
> children: <>
> 2007-02-15 14:20:48.771 PST DEBUG:  vacuuming "pg_catalog.pg_statistic"
> 2007-02-15 14:20:48.771 PST ERROR:  could not access status of
> transaction 2565134864
> 2007-02-15 14:20:48.772 PST DETAIL:  could not open file "pg_clog/
> 098E": No such file or directory
> 2007-02-15 14:20:48.772 PST DEBUG:  proc_exit(0)
> 2007-02-15 14:20:48.772 PST DEBUG:  shmem_exit(0)
> 2007-02-15 14:20:48.773 PST DEBUG:  exit(0)
> 2007-02-15 14:20:48.775 PST DEBUG:  reaping dead processes
>
> does that imply that it is the pg_statistic table that is hosed?
>
> Interestingly I can manually vacuum that table in all of the
> databases on this machine without provoking the error.

Except template0 I presume?  Is this autovacuum running in template0
perchance?  I note that 800 million transactions have passed since the
Xid in the error message was current.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Where art thou pg_clog?

From
Casey Duncan
Date:
On Feb 15, 2007, at 2:44 PM, Alvaro Herrera wrote:

> Casey Duncan wrote:
>>
>> On Feb 15, 2007, at 1:46 PM, Alvaro Herrera wrote:
>>> [..]
>>> Can you relate it to autovacuum?
>>
>> Maybe. Here's what I get when I crank up the logging to debug4:
>>
>> 2007-02-15 14:20:48.771 PST DEBUG:  StartTransaction
>> 2007-02-15 14:20:48.771 PST DEBUG:  name: unnamed; blockState:
>> DEFAULT; state: INPROGR, xid/subid/cid: 3429052708/1/0, nestlvl: 1,
>> children: <>
>> 2007-02-15 14:20:48.771 PST DEBUG:  vacuuming
>> "pg_catalog.pg_statistic"
>> 2007-02-15 14:20:48.771 PST ERROR:  could not access status of
>> transaction 2565134864
>> 2007-02-15 14:20:48.772 PST DETAIL:  could not open file "pg_clog/
>> 098E": No such file or directory
>> 2007-02-15 14:20:48.772 PST DEBUG:  proc_exit(0)
>> 2007-02-15 14:20:48.772 PST DEBUG:  shmem_exit(0)
>> 2007-02-15 14:20:48.773 PST DEBUG:  exit(0)
>> 2007-02-15 14:20:48.775 PST DEBUG:  reaping dead processes
>>
>> does that imply that it is the pg_statistic table that is hosed?
>>
>> Interestingly I can manually vacuum that table in all of the
>> databases on this machine without provoking the error.
>
> Except template0 I presume?  Is this autovacuum running in template0
> perchance?  I note that 800 million transactions have passed since the
> Xid in the error message was current.

Wouldn't you know it! A little farther back up in the log file:

2007-02-15 14:20:48.480 PST LOG:  autovacuum: processing database
"template0"
2007-02-15 14:20:48.480 PST DEBUG:  StartTransaction
2007-02-15 14:20:48.480 PST DEBUG:  name: unnamed; blockState:
DEFAULT; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,
children: <>
2007-02-15 14:20:48.481 PST DEBUG:  autovacuum: VACUUM FREEZE whole
database
2007-02-15 14:20:48.481 PST DEBUG:  CommitTransaction
2007-02-15 14:20:48.481 PST DEBUG:  name: unnamed; blockState:
STARTED; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,
children: <>

fwiw, I did a cluster-wide vacuum on 1/20/2007. Not sure if that has
any impact on anything, just thought I'd throw it out there.

I'm curious how template0 got stomped on. Certainly nothing's been
changing it. Of course it might just be some random bug so the fact
it landed on a file for template0 could be completely arbitrary.
Anyhow it does seem curious to me.

-Casey

Re: Where art thou pg_clog?

From
Alvaro Herrera
Date:
Casey Duncan wrote:

> >>Interestingly I can manually vacuum that table in all of the
> >>databases on this machine without provoking the error.
> >
> >Except template0 I presume?  Is this autovacuum running in template0
> >perchance?  I note that 800 million transactions have passed since the
> >Xid in the error message was current.
>
> Wouldn't you know it! A little farther back up in the log file:
>
> 2007-02-15 14:20:48.480 PST LOG:  autovacuum: processing database
> "template0"
> 2007-02-15 14:20:48.480 PST DEBUG:  StartTransaction
> 2007-02-15 14:20:48.480 PST DEBUG:  name: unnamed; blockState:
> DEFAULT; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,
> children: <>
> 2007-02-15 14:20:48.481 PST DEBUG:  autovacuum: VACUUM FREEZE whole
> database
> 2007-02-15 14:20:48.481 PST DEBUG:  CommitTransaction
> 2007-02-15 14:20:48.481 PST DEBUG:  name: unnamed; blockState:
> STARTED; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,
> children: <>

This is a bug we fixed in 8.1.7.  I suggest you update to the latest of
the 8.1 series, to get that fix among others.

To fix the problem, set pg_database.datallowconn=true for template0,
then connect to it and do a VACUUM FREEZE.  Then set datallowconn=false
again.


> I'm curious how template0 got stomped on. Certainly nothing's been
> changing it. Of course it might just be some random bug so the fact
> it landed on a file for template0 could be completely arbitrary.

The problem is that all databases are vacuumed every so many
transactions, to avoid Xid wraparound problems; even non connectable
databases.  The problem is that a bug in autovacuum caused that vacuum
operation to neglect using the FREEZE flag; this negligence makes it
leave non-permanent Xids in the tables, leading to the problem you're
seeing.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Where art thou pg_clog?

From
Casey Duncan
Date:
On Feb 15, 2007, at 5:21 PM, Alvaro Herrera wrote:

> Casey Duncan wrote:
>
>>>> Interestingly I can manually vacuum that table in all of the
>>>> databases on this machine without provoking the error.
>>>
>>> Except template0 I presume?  Is this autovacuum running in template0
>>> perchance?  I note that 800 million transactions have passed
>>> since the
>>> Xid in the error message was current.
>>
>> Wouldn't you know it! A little farther back up in the log file:
>>
>> 2007-02-15 14:20:48.480 PST LOG:  autovacuum: processing database
>> "template0"
>> 2007-02-15 14:20:48.480 PST DEBUG:  StartTransaction
>> 2007-02-15 14:20:48.480 PST DEBUG:  name: unnamed; blockState:
>> DEFAULT; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,
>> children: <>
>> 2007-02-15 14:20:48.481 PST DEBUG:  autovacuum: VACUUM FREEZE whole
>> database
>> 2007-02-15 14:20:48.481 PST DEBUG:  CommitTransaction
>> 2007-02-15 14:20:48.481 PST DEBUG:  name: unnamed; blockState:
>> STARTED; state: INPROGR, xid/subid/cid: 3429052629/1/0, nestlvl: 1,
>> children: <>
>
> This is a bug we fixed in 8.1.7.  I suggest you update to the
> latest of
> the 8.1 series, to get that fix among others.

ok, great.

> To fix the problem, set pg_database.datallowconn=true for template0,
> then connect to it and do a VACUUM FREEZE.  Then set
> datallowconn=false
> again.

Do you mean to do this after upgrading to 8.1.8? If I try than in
8.1.5, I get (unsurprisingly):

% psql -U postgres template0 -c "vacuum freeze"
ERROR:  could not access status of transaction 2565134864
DETAIL:  could not open file "pg_clog/098E": No such file or directory

>
>> I'm curious how template0 got stomped on. Certainly nothing's been
>> changing it. Of course it might just be some random bug so the fact
>> it landed on a file for template0 could be completely arbitrary.
>
> The problem is that all databases are vacuumed every so many
> transactions, to avoid Xid wraparound problems; even non connectable
> databases.  The problem is that a bug in autovacuum caused that vacuum
> operation to neglect using the FREEZE flag; this negligence makes it
> leave non-permanent Xids in the tables, leading to the problem you're
> seeing.

Ironically we were earlier bitten by the bug that autovacuum didn't
do the cluster-wide vacuum until too late. Now we got bitten by the
fact that did do the cluster-wide vacuum. Talk about damned-if-you-do-
and-damned-if-you-don't! 8^)

ok, this is a much better sounding explanation than "random data
corruption" ;^)

Thanks!

-Casey

Re: Where art thou pg_clog?

From
Alvaro Herrera
Date:
Casey Duncan wrote:
>
> On Feb 15, 2007, at 5:21 PM, Alvaro Herrera wrote:
>
> >Casey Duncan wrote:

> >To fix the problem, set pg_database.datallowconn=true for template0,
> >then connect to it and do a VACUUM FREEZE.  Then set
> >datallowconn=false
> >again.
>
> Do you mean to do this after upgrading to 8.1.8? If I try than in
> 8.1.5, I get (unsurprisingly):
>
> % psql -U postgres template0 -c "vacuum freeze"
> ERROR:  could not access status of transaction 2565134864
> DETAIL:  could not open file "pg_clog/098E": No such file or directory

Hum, yeah, I forgot to mention that you need to create the 098E pg_clog
segment for that to work at all :-)  Fill it with byte 0x55 till the
needed position, which is the bit pattern for "all transactions
committed".  I'd make sure to remove it manually after the freeze is
done, just in case!  (I think the system would remove it at next
checkpoint, but anyway.)

You can do it either after or before upgrading; it's the same.  The only
thing that changes in 8.1.7 is that an upcoming vacuum would not forget
the FREEZE.


> >>I'm curious how template0 got stomped on. Certainly nothing's been
> >>changing it. Of course it might just be some random bug so the fact
> >>it landed on a file for template0 could be completely arbitrary.
> >
> >The problem is that all databases are vacuumed every so many
> >transactions, to avoid Xid wraparound problems; even non connectable
> >databases.  The problem is that a bug in autovacuum caused that vacuum
> >operation to neglect using the FREEZE flag; this negligence makes it
> >leave non-permanent Xids in the tables, leading to the problem you're
> >seeing.
>
> Ironically we were earlier bitten by the bug that autovacuum didn't
> do the cluster-wide vacuum until too late. Now we got bitten by the
> fact that did do the cluster-wide vacuum. Talk about damned-if-you-do-
> and-damned-if-you-don't! 8^)

Heh :-)  Sorry, they are all my bugs.  I guess you should be throwing
stones at me or something.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Where art thou pg_clog?

From
Tom Lane
Date:
Alvaro Herrera <alvherre@commandprompt.com> writes:
> Casey Duncan wrote:
>> I'm curious how template0 got stomped on.

> Heh :-)  Sorry, they are all my bugs.  I guess you should be throwing
> stones at me or something.

The pre-8.1 theory was that template0 is (supposed to be) cleanly frozen
and hence never needs vacuumed at all.  The post-8.1 theory is that
template0 gets autovacuumed when necessary to prevent wraparound, just
like every other database.  8.1 unfortunately is somewhere in the
middle, because under circumstances-I-don't-remember-at-the-moment,
autovacuum might decide to process template0 and then leave non-frozen
XIDs therein.  Which is a problem because the clog-truncation logic
didn't think it needed to consider template0 when deciding if old clog
segments could be thrown away.  We live and learn.

The real bottom line here, and one I'll reiterate every chance I get,
is that we don't make updates to back branches because we're too bored
to have anything else to do.  If you're on 8.1.5, and the current
release in that branch is 8.1.8, then you're missing some bug fixes
that are probably significant.

            regards, tom lane

Re: Where art thou pg_clog?

From
Casey Duncan
Date:
On Feb 15, 2007, at 5:50 PM, Alvaro Herrera wrote:

> Casey Duncan wrote:
>>
>> On Feb 15, 2007, at 5:21 PM, Alvaro Herrera wrote:
>>
>>> Casey Duncan wrote:
>
>>> To fix the problem, set pg_database.datallowconn=true for template0,
>>> then connect to it and do a VACUUM FREEZE.  Then set
>>> datallowconn=false
>>> again.
>>
>> Do you mean to do this after upgrading to 8.1.8? If I try than in
>> 8.1.5, I get (unsurprisingly):
>>
>> % psql -U postgres template0 -c "vacuum freeze"
>> ERROR:  could not access status of transaction 2565134864
>> DETAIL:  could not open file "pg_clog/098E": No such file or
>> directory
>
> Hum, yeah, I forgot to mention that you need to create the 098E
> pg_clog
> segment for that to work at all :-)  Fill it with byte 0x55 till the
> needed position, which is the bit pattern for "all transactions
> committed".  I'd make sure to remove it manually after the freeze is
> done, just in case!  (I think the system would remove it at next
> checkpoint, but anyway.)

That seems a bit scary to do on a running production server. Could I
get away with dropping the template0 database and loading one from
another identical pg instance (or a new one) or will that freak
things out?

-Casey

Re: Where art thou pg_clog?

From
Alvaro Herrera
Date:
Casey Duncan wrote:
>
> On Feb 15, 2007, at 5:50 PM, Alvaro Herrera wrote:

> >Hum, yeah, I forgot to mention that you need to create the 098E
> >pg_clog
> >segment for that to work at all :-)  Fill it with byte 0x55 till the
> >needed position, which is the bit pattern for "all transactions
> >committed".  I'd make sure to remove it manually after the freeze is
> >done, just in case!  (I think the system would remove it at next
> >checkpoint, but anyway.)
>
> That seems a bit scary to do on a running production server. Could I
> get away with dropping the template0 database and loading one from
> another identical pg instance (or a new one) or will that freak
> things out?

If you haven't modified template1 since initdb, you can recreate
template0 by just dropping it and copying from template1; then do a
VACUUM FREEZE and set datallowconn to false.  That's what initdb does.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: Where art thou pg_clog?

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


> The real bottom line here, and one I'll reiterate every chance I get,
> is that we don't make updates to back branches because we're too bored
> to have anything else to do.  If you're on 8.1.5, and the current
> release in that branch is 8.1.8, then you're missing some bug fixes
> that are probably significant.

Just as a data point, I came across this very same problem (corrupted
tuple header, invalid xlog file) on an 8.1.3 system that was NOT
running autovacuum (and never had). Amazingly, this occured on the
very same day as the original poster. The server was upgraded to 8.2.3,
after some creating-bogus-xlog-file pain to extract all the data, and
all is well again.

- --
Greg Sabino Mullane greg@turnstep.com
PGP Key: 0x14964AC8 200702221021
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8

-----BEGIN PGP SIGNATURE-----

iD8DBQFF3bW8vJuQZxSWSsgRA45cAKCQYcPdmqvNh9KRBGNsm/YjycmqFQCgzIil
nUXZs7wIJvkxs6RaBTW5cKA=
=V0YI
-----END PGP SIGNATURE-----