One of our databases suffered a problem yesterday during a normal
update, something we have been doing for years. Near the end of the
process a foreign key constraint is rebuilt on a table containing
several hundred million rows. Rebuilding the constraint failed with
the following message:
ERROR: could not access status of transaction 4294918145
DETAIL: Could not open file "pg_clog/0FFF": No such file or directory.
This looks like a garden-variety data corruption problem to me. Trashed rows tend to yield this type of error because the "xmin" transaction ID is the first field that the server can check with any amount of finesse. 4294918145 is FFFF4001 in hex, saith my calculator, so it looks like a bunch of bits went to ones --- or perhaps more likely, the row offset in the page header got clobbered and we're looking at some bytes that never were a transaction ID at all.
So I'd try looking around for flaky RAM, failing disks, loose cables, that sort of thing ...
Are you aware of any issues like this related to VMWare ESX? Our PostgreSQL server is running in such an environment and I asked the guys to review your email and they thought maybe this type of corruption could happen when the virtual machine was moved from one physical server to another, which we have done once or twice in the past few months.