Thread: could not access status of transaction

could not access status of transaction

From
Tom Duffey
Date:
Hi All,

One of our databases suffered a problem yesterday during a normal update, something we have been doing for years.  Near the end of the process a foreign key constraint is rebuilt on a table containing several hundred million rows.  Rebuilding the constraint failed with the following message:

ERROR:  could not access status of transaction 4294918145
DETAIL:  Could not open file "pg_clog/0FFF": No such file or directory.

Here's the table and constraint definitions:

CREATE TABLE point_history (
    point_id integer NOT NULL,
    value real NOT NULL,
    status integer NOT NULL,
    "timestamp" timestamp without time zone NOT NULL
);

ALTER TABLE point_history ADD CONSTRAINT point_history_pkey PRIMARY KEY (point_id, "timestamp");
ALTER TABLE point_history ADD CONSTRAINT "$1" FOREIGN KEY (point_id) REFERENCES point(id);

I read about this and and created the pg_clog/0FFF file, filling it with 256K of zeroes and then vacuumed the database.  Then I tried rebuilding the constraint and received a foreign key violation:

DETAIL:  Key (point_id)=(2) is not present in table "point".

The crappy thing about this is that there was no record in the point_history table with point_id = 2:

db=> select * from point_history where point_id = 2;
 point_id | value | status | timestamp 
----------+-------+--------+-----------
(0 rows)

I scratched my head for a while and decided to reload the database from a backup, which I'm still working on now.  I'm wondering if anyone has any thoughts or ideas about this?  I found references to similar problems but they were all for older versions of PostgreSQL.  When the problem occurred we were running 8.3.6 and are now running 8.3.7.

Tom

--
Technology by Design :: http://techbydesign.com/
p: 414.431.0800

Re: could not access status of transaction

From
Tom Lane
Date:
Tom Duffey <tduffey@techbydesign.com> writes:
> One of our databases suffered a problem yesterday during a normal
> update, something we have been doing for years.  Near the end of the
> process a foreign key constraint is rebuilt on a table containing
> several hundred million rows.  Rebuilding the constraint failed with
> the following message:

> ERROR:  could not access status of transaction 4294918145
> DETAIL:  Could not open file "pg_clog/0FFF": No such file or directory.

This looks like a garden-variety data corruption problem to me.
Trashed rows tend to yield this type of error because the "xmin"
transaction ID is the first field that the server can check with
any amount of finesse.  4294918145 is FFFF4001 in hex, saith my
calculator, so it looks like a bunch of bits went to ones --- or
perhaps more likely, the row offset in the page header got clobbered
and we're looking at some bytes that never were a transaction ID
at all.

So I'd try looking around for flaky RAM, failing disks, loose cables,
that sort of thing ...

            regards, tom lane

Re: could not access status of transaction

From
Tom Duffey
Date:
Hi Tom,

On Mar 25, 2009, at 9:02 PM, Tom Lane wrote:

Tom Duffey <tduffey@techbydesign.com> writes:
One of our databases suffered a problem yesterday during a normal  
update, something we have been doing for years.  Near the end of the  
process a foreign key constraint is rebuilt on a table containing  
several hundred million rows.  Rebuilding the constraint failed with  
the following message:

ERROR:  could not access status of transaction 4294918145
DETAIL:  Could not open file "pg_clog/0FFF": No such file or directory.

This looks like a garden-variety data corruption problem to me.
Trashed rows tend to yield this type of error because the "xmin"
transaction ID is the first field that the server can check with
any amount of finesse.  4294918145 is FFFF4001 in hex, saith my
calculator, so it looks like a bunch of bits went to ones --- or
perhaps more likely, the row offset in the page header got clobbered
and we're looking at some bytes that never were a transaction ID
at all.

So I'd try looking around for flaky RAM, failing disks, loose cables,
that sort of thing ...

Are you aware of any issues like this related to VMWare ESX?  Our PostgreSQL server is running in such an environment and I asked the guys to review your email and they thought maybe this type of corruption could happen when the virtual machine was moved from one physical server to another, which we have done once or twice in the past few months.

Tom

--
Technology by Design :: http://techbydesign.com/
p: 414.431.0800