recovering from "found xmin ... from before relfrozenxid ..." - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | recovering from "found xmin ... from before relfrozenxid ..." |
Date | |
Msg-id | CA+TgmoZW1fsU-QUNCRUQMGUygBDPVeOTLCqRdQZch=EYZnctSA@mail.gmail.com Whole thread Raw |
Responses |
Re: recovering from "found xmin ... from before relfrozenxid ..."
Re: recovering from "found xmin ... from before relfrozenxid ..." Re: recovering from "found xmin ... from before relfrozenxid ..." Re: recovering from "found xmin ... from before relfrozenxid ..." Re: recovering from "found xmin ... from before relfrozenxid ..." |
List | pgsql-hackers |
Hi, A number of EDB customers have had this error crop on their tables for reasons that we have usually not been able to determine. In many cases, it's probably down to things like running buggy old releases for a long time before upgrading, or bad backup and recovery procedures. It's more than possible that there are still-unfixed server bugs, but I do not have any compelling evidence of such bugs at this time. Unfortunately, once you're in this situation, it's kind of hard to find your way out of it. There are a few problems: 1. There's nothing to identify the tuple that has the problem, and no way to know how many more of them there might be. Back-patching b61d161c146328ae6ba9ed937862d66e5c8b035a would help with the first part of this. 2. In some other, similar situations, e.g. where the tuple data is garbled, it's often possible to get out from under the problem by deleting the tuple at issue. But I think that doesn't necessarily fix anything in this case. 3. We've had some success with using a PL/plgsql loop with an EXCEPTION block to extract all the accessible tuples from the table. Then you can truncate the original table and reinsert the data. But this is slow, so it stinks if the table is big, and it's not a viable approach if the table in question is a system catalog table -- at least if it's not if it's something critical like pg_class. I realize somebody's probably going to say "well, you shouldn't try to repair a database that's in this state, you shouldn't let it happen in the first place, and if it does happen, you should track the root cause to the ends of the earth." But I think that's a completely impractical approach. I at least have no idea how I'm supposed to figure out when and how a bad relfrozenxid ended up in the table, and by the time the problem is discovered after an upgrade the problem that caused it may be quite old. Moreover, not everyone is as interested in an extended debugging exercise as they are in getting the system working again, and VACUUM failing repeatedly is a pretty serious problem. Therefore, one of my colleagues has - at my request - created a couple of functions called heap_force_kill() and heap_force_freeze() which take an array of TIDs. The former truncates them all to dead line pointers. The latter resets the infomask and xmin to make the xmin frozen. (It should probably handle the xmax too; not sure that the current version does that, but it's easily fixed if not.) The intention is that you can use these to get either get rid of, or get access to, tuples whose visibility information is corrupted for whatever reason. These are pretty sharp tools; you could corrupt a perfectly-good table by incautious use of them, or destroy a large amount of data. You could, for example, force-freeze a tuple created by a transaction which added a column, inserted data, and rolled back; that would likely be disastrous. However, in the cases that I'm thinking about, disaster has already struck, and something that you can use to get things back to a saner state is better than just leaving the table perpetually broken. Without something like this, the backup plan is probably to shut down the server and try to edit the pages using a perl script or something, but that seems clearly worse. So I have these questions: - Do people think it would me smart/good/useful to include something like this in PostgreSQL? - If so, how? I would propose a new contrib module that we back-patch all the way, because the VACUUM errors were back-patched all the way, and there seems to be no advantage in making people wait 5 years for a new version that has some kind of tooling in this area. - Any ideas for additional things we should include, or improvements on the sketch above? Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: