Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 |
Date | |
Msg-id | CA+Tgmob84GcyXG5Hfzi55GG91AG49X3uLjU0dhd+5ju41UfiGQ@mail.gmail.com Whole thread Raw |
In response to | Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of
transaction 1
Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 |
List | pgsql-hackers |
On Thu, Jun 4, 2015 at 12:57 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Jun 4, 2015 at 9:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> Thanks for the review. > > Here's a new version. I've fixed the things Alvaro and Noah noted, > and some compiler warnings about set but unused variables. > > I also tested it, and it doesn't quite work as hoped. If started on a > cluster where oldestMultiXid is incorrectly set to 1, it starts up and > indicates that the member wraparound guards are disabled. But even > after everything is fixed, they don't get enabled until after the next > full restart. I think that's because TruncateMultiXact() bails out > too early, without calling DetermineSafeOldestOffset. > > My attempt at a quick fix for that problem didn't work out, so I'm > posting this version for now to facilitate further review and testing. Here's a new version with some more fixes and improvements: - SetOffsetVacuumLimit was failing to set MultiXactState->oldestOffset when the oldest offset became known if the now-known value happened to be zero. Fixed. - SetOffsetVacuumLimit now logs useful information at the DEBUG1 level, so that you can see that it's doing what it's supposed to. - TruncateMultiXact now calls DetermineSafeOldestOffset to adjust the offsetStopLimit even if it can't truncate anything. This seems useless, but it's not, because it may be that the last checkpoint advanced lastCheckpointedOldest from a bogus value (i.e. 1) to a real value, and now we can actually set offsetStopLimit properly. - TruncateMultiXact no longer calls find_multixact_start when there are no remaining multixacts. This is actually a completely separate bug that goes all the way back to 9.3.0 and can potentially cause TruncateMultiXact to remove every file in pg_multixact/offsets. Restarting the cluster becomes impossible because TrimMultiXact barfs. - TruncateMultiXact now logs a message if the oldest multixact does not precede the earliest one on disk and is not equal to the next multixact and yet does not exist. The value of the log message is that it discovered the bug mentioned in the previous line, so I think it's earning its keep. With this version, I'm able to see that when you start up a 9.3.latest+this patch with a cluster that has a bogus value of 1 in relminmxid, datminmxid, and the control file, autovacuum vacuums everything in sight, all the values get set back to the right thing, and the next checkpoint enables the member-wraparound guards. This works with both autovacuum=on and autovacuum=off; the emergency mechanism kicks in as intended. We'll want to warn people with big databases who upgrade to 9.3.0 - 9.3.4 via pg_upgrade that they may want to pre-vacuum those tables before upgrading to avoid a vacuum storm. But generally I'm pretty happy with this: forcing those values to get fixed so that we can guard against member-space wraparound seems like the right thing to do. So, to summarize, this patch does the following: - Fixes the failure-to-start problems introduced in 9.4.2 in complicated pg_upgrade scenarios. - Prevents the new calls to find_multixact_start we added in 9.4.2 from happening during recovery, where they can only create failure scenarios. The call in TruncateMultiXact that has been there all along is not eliminated, but now handles failure more gracefully. - Fixes possible incorrect removal of every single pg_multixact/offsets file when no multixacts exist; one file should be kept. - Forces aggressive autovacuuming when the control file's oldestMultiXid doesn't point to a valid MultiXact and enables member wraparound at the next checkpoint following the correction of that problem. Thanks, -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
pgsql-hackers by date: