Thread: pg_upgrade < 9.3 -> >=9.3 misses a step around multixacts
Hi, When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old multixacts around because they won't be read after the upgrade (and aren't compatible). It just resets the new cluster's nextMulti to the old + 1. Unfortunately that means that there'll be a offsets/0000 file created by initdb around. Sounds harmless enough, but that'll actually cause problems if the old cluster had a nextMulti that's bigger than that page. When vac_truncate_clog() calls TruncateMultiXact() that'll scan pg_multixact/offsets to find the earliest existing segment. That'll be 0000. If the to-be-truncated data is older than the last existing segment it returns. Then it'll try to determine the last required data in members/ by accessing the oldest data in offsets/. Unfortunately, due to the existing 0000/ segment, that means it'll sometimes try to access a nonexistant offsets/ file. Preventing vacuum et al from succeeding. It seems to me the fix for this is to a) rmtree("pg_multixact/members", false) in copy_clog_xlog_xid() in the oldcluster < 9.3 case b) add a warning to the release notes that everyone that used pg_upgrade and has a 0000 file lying around should report to the mailinglist. b) is a bit unsatisfactory, but I don't want to suggest removing the file. In too many situations it might actually still be needed. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2014-05-30 14:16:31 +0200, Andres Freund wrote: > When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old > multixacts around because they won't be read after the upgrade (and > aren't compatible). It just resets the new cluster's nextMulti to the > old + 1. > Unfortunately that means that there'll be a offsets/0000 file created by > initdb around. Sounds harmless enough, but that'll actually cause > problems if the old cluster had a nextMulti that's bigger than that > page. > > When vac_truncate_clog() calls TruncateMultiXact() that'll scan > pg_multixact/offsets to find the earliest existing segment. That'll be > 0000. If the to-be-truncated data is older than the last existing > segment it returns. Then it'll try to determine the last required data > in members/ by accessing the oldest data in offsets/. > > Unfortunately, due to the existing 0000/ segment, that means it'll > sometimes try to access a nonexistant offsets/ file. Preventing vacuum > et al from succeeding. For the sake of search engines, the error that triggers is: ERROR: could not access status of transaction 2072053907 DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 30, 2014 at 02:16:31PM +0200, Andres Freund wrote: > Hi, > > When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old > multixacts around because they won't be read after the upgrade (and > aren't compatible). It just resets the new cluster's nextMulti to the > old + 1. > Unfortunately that means that there'll be a offsets/0000 file created by > initdb around. Sounds harmless enough, but that'll actually cause > problems if the old cluster had a nextMulti that's bigger than that > page. > > When vac_truncate_clog() calls TruncateMultiXact() that'll scan > pg_multixact/offsets to find the earliest existing segment. That'll be > 0000. If the to-be-truncated data is older than the last existing > segment it returns. Then it'll try to determine the last required data > in members/ by accessing the oldest data in offsets/. > > Unfortunately, due to the existing 0000/ segment, that means it'll > sometimes try to access a nonexistant offsets/ file. Preventing vacuum > et al from succeeding. > > > It seems to me the fix for this is to a) rmtree("pg_multixact/members", > false) in copy_clog_xlog_xid() in the oldcluster < 9.3 case b) add a > warning to the release notes that everyone that used pg_upgrade and has > a 0000 file lying around should report to the mailinglist. > > b) is a bit unsatisfactory, but I don't want to suggest removing the > file. In too many situations it might actually still be needed. This is a bug in 9.3 pg_upgrade as well? Why has no one reported it before? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2014-05-30 09:29:16 -0400, Bruce Momjian wrote: > This is a bug in 9.3 pg_upgrade as well? Yes. > Why has no one reported it before? My guess is that it wasn't attributed to pg_upgrade in the past. Typically the error will only occur a fair amount of time later. You'll just see vacuums randomly erroring out with slru.c errors about nonexistant files :(. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 30, 2014 at 03:32:44PM +0200, Andres Freund wrote: > On 2014-05-30 09:29:16 -0400, Bruce Momjian wrote: > > This is a bug in 9.3 pg_upgrade as well? > > Yes. > > > Why has no one reported it before? > > My guess is that it wasn't attributed to pg_upgrade in the > past. Typically the error will only occur a fair amount of time > later. You'll just see vacuums randomly erroring out with slru.c errors > about nonexistant files :(. But how much later? pg_upgrade is pretty popular now but I am just not seeing the number of errors as I would expect: ERROR: could not access status of transaction 2072053907 DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory. I am not saying there is no bug, but from your analysis it would seem to be 100% of pg_upgrade'ed clusters that use multi-xacts. Is that true? If so, it seems we would need to tell everyone to remove the 0000 files if there are higher numbered ones with numbering gaps. Is this something our next minor release should fix in the multi-xacts code? Fixing pg_upgrade seems like the easy part. -- Bruce_ Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2014-05-30 11:00:06 -0400, Bruce Momjian wrote: > On Fri, May 30, 2014 at 03:32:44PM +0200, Andres Freund wrote: > > On 2014-05-30 09:29:16 -0400, Bruce Momjian wrote: > > > This is a bug in 9.3 pg_upgrade as well? > > > > Yes. > > > > > Why has no one reported it before? > > > > My guess is that it wasn't attributed to pg_upgrade in the > > past. Typically the error will only occur a fair amount of time > > later. You'll just see vacuums randomly erroring out with slru.c errors > > about nonexistant files :(. > > But how much later? pg_upgrade is pretty popular now but I am just not > seeing the number of errors as I would expect: > > ERROR: could not access status of transaction 2072053907 > DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory. > I am not saying there is no bug, but from your analysis it would seem to > be 100% of pg_upgrade'ed clusters that use multi-xacts. It'd need to be clusters that used more multixacts than fit onto two slru offsets/ segments in < 9.2. Otherwise things will probably just continue to work because there's no hole. Also the new cluster needs to have used more than vacuum_multixact_freeze_min_age multis after the upgrade to make the problem visible. I think. > If so, it seems we would need to tell everyone to remove the 0000 files > if there are higher numbered ones with numbering gaps. The problem is that it's not actually that easy to define whether there's a gap - after a multixact id wraparound the 0000 file might exist again. So I'd rather not give a simple instruction that might delete critical data. > Is this something our next minor release should fix in the multi-xacts > code? I wondered whether we could detect that case and deal with it transparently, but haven't come up with something smart. Alvaro? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 30, 2014 at 05:13:17PM +0200, Andres Freund wrote: > On 2014-05-30 11:00:06 -0400, Bruce Momjian wrote: > > On Fri, May 30, 2014 at 03:32:44PM +0200, Andres Freund wrote: > > > On 2014-05-30 09:29:16 -0400, Bruce Momjian wrote: > > > > This is a bug in 9.3 pg_upgrade as well? > > > > > > Yes. > > > > > > > Why has no one reported it before? > > > > > > My guess is that it wasn't attributed to pg_upgrade in the > > > past. Typically the error will only occur a fair amount of time > > > later. You'll just see vacuums randomly erroring out with slru.c errors > > > about nonexistant files :(. > > > > But how much later? pg_upgrade is pretty popular now but I am just not > > seeing the number of errors as I would expect: > > > > ERROR: could not access status of transaction 2072053907 > > DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory. > > > I am not saying there is no bug, but from your analysis it would seem to > > be 100% of pg_upgrade'ed clusters that use multi-xacts. > > It'd need to be clusters that used more multixacts than fit onto two > slru offsets/ segments in < 9.2. Otherwise things will probably just > continue to work because there's no hole. Do you mean <= 9.2 here for the old cluster? Multi-xacts were added in 9.3: commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182 Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: Wed Jan 23 12:04:59 2013 -0300 Improve concurrency of foreign key locking -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2014-05-30 16:35:15 -0400, Bruce Momjian wrote: > On Fri, May 30, 2014 at 05:13:17PM +0200, Andres Freund wrote: > > It'd need to be clusters that used more multixacts than fit onto two > > slru offsets/ segments in < 9.2. Otherwise things will probably just > > continue to work because there's no hole. > > Do you mean <= 9.2 here for the old cluster? yes. > Multi-xacts were added in 9.3: That's not really correct. They were added in 8.2 or something... > commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182 > Author: Alvaro Herrera <alvherre@alvh.no-ip.org> > Date: Wed Jan 23 12:04:59 2013 -0300 > > Improve concurrency of foreign key locking That just expanded the usage to also be able to track updated rows. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, May 30, 2014 at 11:03:53PM +0200, Andres Freund wrote: > On 2014-05-30 16:35:15 -0400, Bruce Momjian wrote: > > On Fri, May 30, 2014 at 05:13:17PM +0200, Andres Freund wrote: > > > It'd need to be clusters that used more multixacts than fit onto two > > > slru offsets/ segments in < 9.2. Otherwise things will probably just > > > continue to work because there's no hole. > > > > Do you mean <= 9.2 here for the old cluster? > > yes. > > > Multi-xacts were added in 9.3: > > That's not really correct. They were added in 8.2 or something... > > > commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182 > > Author: Alvaro Herrera <alvherre@alvh.no-ip.org> > > Date: Wed Jan 23 12:04:59 2013 -0300 > > > > Improve concurrency of foreign key locking > > That just expanded the usage to also be able to track updated rows. I guess I meant the new multixacts file format was in 9.3. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Fri, May 30, 2014 at 05:25:26PM -0400, Bruce Momjian wrote: > On Fri, May 30, 2014 at 11:03:53PM +0200, Andres Freund wrote: > > On 2014-05-30 16:35:15 -0400, Bruce Momjian wrote: > > > On Fri, May 30, 2014 at 05:13:17PM +0200, Andres Freund wrote: > > > > It'd need to be clusters that used more multixacts than fit onto two > > > > slru offsets/ segments in < 9.2. Otherwise things will probably just > > > > continue to work because there's no hole. > > > > > > Do you mean <= 9.2 here for the old cluster? > > > > yes. > > > > > Multi-xacts were added in 9.3: > > > > That's not really correct. They were added in 8.2 or something... > > > > > commit 0ac5ad5134f2769ccbaefec73844f8504c4d6182 > > > Author: Alvaro Herrera <alvherre@alvh.no-ip.org> > > > Date: Wed Jan 23 12:04:59 2013 -0300 > > > > > > Improve concurrency of foreign key locking > > > > That just expanded the usage to also be able to track updated rows. > > I guess I meant the new multixacts file format was in 9.3. It appears this item is on hold until Alvaro can comment. I can probably find the impact and correct the problem for previous upgrades, but it might be 1-2 months until I can get to it. The fix for future upgrades is simple. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Andres Freund wrote: > Hi, > > When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old > multixacts around because they won't be read after the upgrade (and > aren't compatible). It just resets the new cluster's nextMulti to the > old + 1. > Unfortunately that means that there'll be a offsets/0000 file created by > initdb around. Sounds harmless enough, but that'll actually cause > problems if the old cluster had a nextMulti that's bigger than that > page. > > When vac_truncate_clog() calls TruncateMultiXact() that'll scan > pg_multixact/offsets to find the earliest existing segment. That'll be > 0000. If the to-be-truncated data is older than the last existing > segment it returns. Then it'll try to determine the last required data > in members/ by accessing the oldest data in offsets/. I'm trying to understand the mechanism of this bug, and I'm not succeeding. If the offset/0000 was created by initdb, how come we try to delete a file that's not also members/0000? I mean, surely the file as created by initdb is empty (zeroed). In your sample error message downthread, ERROR: could not access status of transaction 2072053907 DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory. what prompted the status of that multixid to be sought? I see one possible path to this error message, which is SlruPhysicalReadPage(). (There are other paths that lead to similar errors, but they use "transaction 0" instead, so we can rule those out; and we can rule out anything that uses MultiXactMemberCtl because of the path given in DETAIL.) There are four callsites that lead to that: RecordNewMultiXact GetMultiXactIdMembers (2x) TrimMultiXact Of those, only GetMultiXactIdMembers is likely to be called from vacuum (actually RecordNewMultiXact can too, in a few cases, if it happens to freeze a multi by creating another multi; should be pretty rare). But you were talking about vacuum truncating pg_multixact -- and I don't see how that's related to these functions. Is it possible that you pasted the wrong error message? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-06-13 13:51:51 -0400, Alvaro Herrera wrote: > Andres Freund wrote: > > Hi, > > > > When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old > > multixacts around because they won't be read after the upgrade (and > > aren't compatible). It just resets the new cluster's nextMulti to the > > old + 1. > > Unfortunately that means that there'll be a offsets/0000 file created by > > initdb around. Sounds harmless enough, but that'll actually cause > > problems if the old cluster had a nextMulti that's bigger than that > > page. > > > > When vac_truncate_clog() calls TruncateMultiXact() that'll scan > > pg_multixact/offsets to find the earliest existing segment. That'll be > > 0000. If the to-be-truncated data is older than the last existing > > segment it returns. Then it'll try to determine the last required data > > in members/ by accessing the oldest data in offsets/. > > I'm trying to understand the mechanism of this bug, and I'm not > succeeding. If the offset/0000 was created by initdb, how come we try > to delete a file that's not also members/0000? I mean, surely the file > as created by initdb is empty (zeroed). In your sample error message > downthread, The bit you're missing is essentially the following in TruncateMultiXact(): /* * Note we can't just plow ahead with the truncation; it's possible that * there are no segments to truncate, which is a problem because we are * going to attempt to read the offsets page to determine where to * truncate the members SLRU. So we first scan the directory to determine * the earliest offsets page number that we can read without error. */ That check is thwarted due to the 0000 segment. So the segment containing the oldestMXact has already been removed, but we don't notice that that's the case. > ERROR: could not access status of transaction 2072053907 > DETAIL: Could not open file "pg_multixact/offsets/7B81": No such file or directory. > > what prompted the status of that multixid to be sought? I see one > possible path to this error message, which is SlruPhysicalReadPage(). > (There are other paths that lead to similar errors, but they use > "transaction 0" instead, so we can rule those out; and we can rule out > anything that uses MultiXactMemberCtl because of the path given in > DETAIL.) Same function: /* * First, compute the safe truncation point for MultiXactMember. This is * the starting offset of the multixact we were passed as MultiXactOffset * cutoff. */ { int pageno; int slotno; int entryno; MultiXactOffset *offptr; /* lock is acquired by SimpleLruReadPage_ReadOnly */ pageno = MultiXactIdToOffsetPage(oldestMXact); entryno = MultiXactIdToOffsetEntry(oldestMXact); slotno = SimpleLruReadPage_ReadOnly(MultiXactOffsetCtl, pageno, oldestMXact); offptr = (MultiXactOffset *) MultiXactOffsetCtl->shared->page_buffer[slotno]; offptr += entryno; oldestOffset = *offptr; LWLockRelease(MultiXactOffsetControlLock); } Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Bruce Momjian wrote: > It appears this item is on hold until Alvaro can comment. I can > probably find the impact and correct the problem for previous upgrades, > but it might be 1-2 months until I can get to it. The fix for future > upgrades is simple. TBH I think future upgrades is what we need to fix now, so that people don't do broken upgrades from 9.2 to 9.4 once the latter is out. (I would assume that upgrades from 9.3 will not have a problem, unless the 9.3 setup itself is already broken due to a prior pg_upgrade). Also, we need to fix pg_upgrade to protect people upgrading to 9.3.5 from <= 9.2. As for detecting the case in existing installs, I think we can supply a query in the next minor release notes that uses pg_ls_dir() to see whether there is a 0000 file outside the working range of multixact/offset, whose result if nonempty would indicate the need to delete (or more likely rename) a file. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Bruce Momjian wrote: > On Fri, May 30, 2014 at 02:16:31PM +0200, Andres Freund wrote: > > Hi, > > > > When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old > > multixacts around because they won't be read after the upgrade (and > > aren't compatible). It just resets the new cluster's nextMulti to the > > old + 1. > > Unfortunately that means that there'll be a offsets/0000 file created by > > initdb around. Sounds harmless enough, but that'll actually cause > > problems if the old cluster had a nextMulti that's bigger than that > > page. I've been playing with this a bit. Here's a patch that just does rmtree() of the problematic file during pg_upgrade, as proposed by Andres, which solves the problem. Note that this patch removes the 0000 file in both cases: when upgrading from 9.2, and when upgrading from 9.3+. The former case is the bug that Andres reported. In the second case, we overwrite the files with the ones from the old cluster; if there's a lingering 0000 file in the new cluster, it would cause problems as well. (Really, I don't see any reason to think that these two cases are any different.) > This is a bug in 9.3 pg_upgrade as well? Why has no one reported it > before? I think one reason is that not all upgrades see an issue here; for old clusters that haven't gone beyond the 0000 offset file, there is no problem. For clusters that have gone beyond 0000 but not by much, the file will be deleted during the first truncation. It only becomes a problem if the cluster is close enough to 2^31. Another thing to keep in consideration is that initdb initializes all databases' datminmxid to 1. If the old cluster was past the 2^31 point, it means the datminmxid doesn't move from 1 until the actual wraparound. I noticed another problem while trying to reproduce it. It only happens in assert-enabled builds: when FreezeMultiXactId() sees an old multi, it asserts that it mustn't be running anymore, which is fine except that if the multi value is one that survived a pg_upgrade cycle from 9.2 or older (i.e. it was a share-locked tuple in the old cluster), an error is raised when the assertion is checked. Since it doesn't make sense to examine its running status at that point, the attached patch simply disables the assertion in that case. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On Wed, Jun 18, 2014 at 06:51:31PM -0400, Alvaro Herrera wrote: > Bruce Momjian wrote: > > On Fri, May 30, 2014 at 02:16:31PM +0200, Andres Freund wrote: > > > Hi, > > > > > > When upgrading a < 9.3 cluster pg_upgrade doesn't bother to keep the old > > > multixacts around because they won't be read after the upgrade (and > > > aren't compatible). It just resets the new cluster's nextMulti to the > > > old + 1. > > > Unfortunately that means that there'll be a offsets/0000 file created by > > > initdb around. Sounds harmless enough, but that'll actually cause > > > problems if the old cluster had a nextMulti that's bigger than that > > > page. > > I've been playing with this a bit. Here's a patch that just does > rmtree() of the problematic file during pg_upgrade, as proposed by > Andres, which solves the problem. Note that this patch removes the 0000 > file in both cases: when upgrading from 9.2, and when upgrading from > 9.3+. The former case is the bug that Andres reported. In the second > case, we overwrite the files with the ones from the old cluster; if > there's a lingering 0000 file in the new cluster, it would cause > problems as well. (Really, I don't see any reason to think that these > two cases are any different.) I wasn't happy with having that delete code added there when we do directory delete in the function above. I instead broke apart the delete and copy code and called the delete code where needed, in the attached patch. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Attachment
On Wed, Jun 18, 2014 at 06:51:31PM -0400, Alvaro Herrera wrote: > > This is a bug in 9.3 pg_upgrade as well? Why has no one reported it > > before? > > I think one reason is that not all upgrades see an issue here; for old > clusters that haven't gone beyond the 0000 offset file, there is no > problem. For clusters that have gone beyond 0000 but not by much, the > file will be deleted during the first truncation. It only becomes a > problem if the cluster is close enough to 2^31. Another thing to keep > in consideration is that initdb initializes all databases' datminmxid to > 1. If the old cluster was past the 2^31 point, it means the datminmxid > doesn't move from 1 until the actual wraparound. OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a problem? That might explain the rare reporting of this bug. What would the test query look like so we can tell people when to remove the '0000' files? Would we need to see the existence of '0000' and high-numbered files? How high? What does a 2^31 file look like? Also, is there a reason you didn't remove the 'members/0000' file in your patch? I have removed it in my version. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Wed, Jun 18, 2014 at 09:52:05PM -0400, Bruce Momjian wrote: > On Wed, Jun 18, 2014 at 06:51:31PM -0400, Alvaro Herrera wrote: > > > This is a bug in 9.3 pg_upgrade as well? Why has no one reported it > > > before? > > > > I think one reason is that not all upgrades see an issue here; for old > > clusters that haven't gone beyond the 0000 offset file, there is no > > problem. For clusters that have gone beyond 0000 but not by much, the > > file will be deleted during the first truncation. It only becomes a > > problem if the cluster is close enough to 2^31. Another thing to keep > > in consideration is that initdb initializes all databases' datminmxid to > > 1. If the old cluster was past the 2^31 point, it means the datminmxid > > doesn't move from 1 until the actual wraparound. > > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a > problem? That might explain the rare reporting of this bug. What would > the test query look like so we can tell people when to remove the '0000' > files? Would we need to see the existence of '0000' and high-numbered > files? How high? What does a 2^31 file look like? Also, what would a legitimate 0000 file at wrap-around time look like? Would there have to be an 'ffff' or 'ffffff' file? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > I wasn't happy with having that delete code added there when we do > directory delete in the function above. I instead broke apart the > delete and copy code and called the delete code where needed, in the > attached patch. Makes sense, yeah. I didn't look closely enough to realize that the function that does the copying also does the rmtree part. I also now realize why the other case (upgrade from 9.3 to 9.4) does not have a bug: we are already deleting the files in that path. Bruce Momjian wrote: > > I think one reason is that not all upgrades see an issue here; for old > > clusters that haven't gone beyond the 0000 offset file, there is no > > problem. For clusters that have gone beyond 0000 but not by much, the > > file will be deleted during the first truncation. It only becomes a > > problem if the cluster is close enough to 2^31. Another thing to keep > > in consideration is that initdb initializes all databases' datminmxid to > > 1. If the old cluster was past the 2^31 point, it means the datminmxid > > doesn't move from 1 until the actual wraparound. > > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a > problem? That might explain the rare reporting of this bug. What would > the test query look like so we can tell people when to remove the '0000' > files? Would we need to see the existence of '0000' and high-numbered > files? How high? What does a 2^31 file look like? I misspoke. I ran a few more upgrades, and then tried vacuuming all databases, which is when the truncate code is run. Say the original cluster had an oldestmulti of 10 million. If you just run VACUUM in the new cluster after the upgrade, the 0000 file is not deleted: it's not yet old enough in terms of multixact age. An error is not thrown, because we're still not attempting a truncate. But if you lower the vacuum_multixact_freeze_table_age to 10 million minus one, then we will try the deletion and that will raise the error. I think (didn't actually try) if you just let 150 million multixacts be generated, that's the first time you will get the error. Now if you run a VACUUM FREEZE after the upgrade, the file will be deleted with no error. I now think that the reason most people haven't hit the problem is that they don't generate enough multis after upgrading a database that had enough multis in the old database. This seems a bit curious > Also, is there a reason you didn't remove the 'members/0000' file in your > patch? I have removed it in my version. There's no point. That file is the starting point for new multis anyway, and it's compatible with the new format (because it's all zeroes). I guess you could get in trouble if you initdb the 9.3 database, generate a few multis there, and then drop all tables and use that modified cluster as a "new" cluster for pg_upgrade. I would question the sanity of a person doing something like that, however. Bruce Momjian wrote: > > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a > > problem? That might explain the rare reporting of this bug. What would > > the test query look like so we can tell people when to remove the '0000' > > files? Would we need to see the existence of '0000' and high-numbered > > files? How high? What does a 2^31 file look like? > > Also, what would a legitimate 0000 file at wrap-around time look like? > Would there have to be an 'ffff' or 'ffffff' file? Since I was wrong, there is no point in further research here. Anyway the last file before wrapping around in pg_multixact/members is FFFF. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
BTW I hacked up pg_resetxlog a bit to make it generate the necessary pg_multixact/offset file when -m is given. This is not acceptable for commit because it duplicates the #defines from pg_multixact.c, but maybe we want this functionality enough that we're interested in a more complete version of this patch; also it unconditionally writes one zero byte to the file, which is of course wrong if the file exists and already contains data. It'd be nice to be able to write this in a way that lets it work for all existing SLRU users (pg_clog is the most common, but pg_multixact/members and the predicate locking stuff might also be useful). Not sure what would that look like. Another consideration is that it doesn't remove existing files that are outside of the new valid range according to freezing parameters and such, but I'm not sure this is really doable considering that we might need to change the relminmxid and datminmxid values, etc. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Alvaro Herrera wrote: > Bruce Momjian wrote: > Bruce Momjian wrote: > > > > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a > > > problem? That might explain the rare reporting of this bug. What would > > > the test query look like so we can tell people when to remove the '0000' > > > files? Would we need to see the existence of '0000' and high-numbered > > > files? How high? What does a 2^31 file look like? > > > > Also, what would a legitimate 0000 file at wrap-around time look like? > > Would there have to be an 'ffff' or 'ffffff' file? > > Since I was wrong, there is no point in further research here. Anyway > the last file before wrapping around in pg_multixact/members is FFFF. Oops, I meant the last file before wrap in pg_multixact/offsets is FFFF, which is what we're talking about in this thread. For members it's 14078, but it's not relevant here. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 19, 2014 at 06:04:25PM -0400, Alvaro Herrera wrote: > Bruce Momjian wrote: > > > I wasn't happy with having that delete code added there when we do > > directory delete in the function above. I instead broke apart the > > delete and copy code and called the delete code where needed, in the > > attached patch. > > Makes sense, yeah. I didn't look closely enough to realize that the > function that does the copying also does the rmtree part. OK. Should I apply my patch so at least pg_upgrade is good going forward? > I also now realize why the other case (upgrade from 9.3 to 9.4) does not > have a bug: we are already deleting the files in that path. Right, and I think the patch makes it clearer why we need those 'rm' function calls because they mirror the 'copy' ones. > > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a > > problem? That might explain the rare reporting of this bug. What would > > the test query look like so we can tell people when to remove the '0000' > > files? Would we need to see the existence of '0000' and high-numbered > > files? How high? What does a 2^31 file look like? > > I misspoke. > > I ran a few more upgrades, and then tried vacuuming all databases, which > is when the truncate code is run. Say the original cluster had an > oldestmulti of 10 million. If you just run VACUUM in the new cluster > after the upgrade, the 0000 file is not deleted: it's not yet old enough > in terms of multixact age. An error is not thrown, because we're still > not attempting a truncate. But if you lower the > vacuum_multixact_freeze_table_age to 10 million minus one, then we will > try the deletion and that will raise the error. > > I think (didn't actually try) if you just let 150 million multixacts be > generated, that's the first time you will get the error. > > Now if you run a VACUUM FREEZE after the upgrade, the file will be > deleted with no error. > > I now think that the reason most people haven't hit the problem is that > they don't generate enough multis after upgrading a database that had > enough multis in the old database. This seems a bit curious OK, that does make more sense. A user would need to have the to exceeded 0000 to the point where it was removed from their old cluster, and _then_ run far enough past the freeze horizon to again require file removal. This does make sense why we are seeing the bug only now, and while a quick minor release with a query to fix this will get us out of the problem with minimal impact. > > Also, is there a reason you didn't remove the 'members/0000' file in your > > patch? I have removed it in my version. > > There's no point. That file is the starting point for new multis > anyway, and it's compatible with the new format (because it's all > zeroes). I think it should be done for consistency with the 'copy' function calls above. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Jun 19, 2014 at 06:12:41PM -0400, Alvaro Herrera wrote: > BTW I hacked up pg_resetxlog a bit to make it generate the necessary > pg_multixact/offset file when -m is given. This is not acceptable for > commit because it duplicates the #defines from pg_multixact.c, but maybe > we want this functionality enough that we're interested in a more > complete version of this patch; also it unconditionally writes one zero > byte to the file, which is of course wrong if the file exists and > already contains data. Why would we want this if the system functions fine without those files being created? I don't think we want pg_resetxlog to be doing anything except changing pg_controldata. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > On Thu, Jun 19, 2014 at 06:12:41PM -0400, Alvaro Herrera wrote: > > BTW I hacked up pg_resetxlog a bit to make it generate the necessary > > pg_multixact/offset file when -m is given. This is not acceptable for > > commit because it duplicates the #defines from pg_multixact.c, but maybe > > we want this functionality enough that we're interested in a more > > complete version of this patch; also it unconditionally writes one zero > > byte to the file, which is of course wrong if the file exists and > > already contains data. > > Why would we want this if the system functions fine without those files > being created? I don't think we want pg_resetxlog to be doing anything > except changing pg_controldata. I don't understand why you say the system functions fine. If you move the multixactid with pg_resetxlog to a point which doesn't have the necessary file and page, the system will not start. Only pg_upgrade knows to create the file appropriately, but normal operation doesn't. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > Bruce Momjian wrote: >> Why would we want this if the system functions fine without those files >> being created? I don't think we want pg_resetxlog to be doing anything >> except changing pg_controldata. > I don't understand why you say the system functions fine. If you move > the multixactid with pg_resetxlog to a point which doesn't have the > necessary file and page, the system will not start. Only pg_upgrade > knows to create the file appropriately, but normal operation doesn't. Yeah. In my mind, at least, there's been a TODO for some time for pg_resetxlog to make sure that pg_clog and the other SLRU-like files get populated appropriately when you tell it to change those counters. You have to have a segment file at the right place or startup fails. I'm not sure why Alvaro chose to fix this only for pg_multixact, but fixing that case is better than not fixing anything. regards, tom lane
On Thu, Jun 19, 2014 at 09:06:54PM -0400, Alvaro Herrera wrote: > Alvaro Herrera wrote: > > Bruce Momjian wrote: > > > Bruce Momjian wrote: > > > > > > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a > > > > problem? That might explain the rare reporting of this bug. What would > > > > the test query look like so we can tell people when to remove the '0000' > > > > files? Would we need to see the existence of '0000' and high-numbered > > > > files? How high? What does a 2^31 file look like? > > > > > > Also, what would a legitimate 0000 file at wrap-around time look like? > > > Would there have to be an 'ffff' or 'ffffff' file? > > > > Since I was wrong, there is no point in further research here. Anyway > > the last file before wrapping around in pg_multixact/members is FFFF. > > Oops, I meant the last file before wrap in pg_multixact/offsets is FFFF, > which is what we're talking about in this thread. > > For members it's 14078, but it's not relevant here. OK, so the next questions is, what will the minor-release-note query we give users to test this look like? Do we expect no gaps in numbering? Is it enough to test for the existance of '0000' and lack of '0001' and 'FFFF'? Basically, if we expect no gaps in normal numbering, then a '0000' with no number after it and no wrap-around number before it means the '0000' is left over from initdb and can be removed. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Fri, Jun 20, 2014 at 01:33:52PM -0400, Alvaro Herrera wrote: > Bruce Momjian wrote: > > On Thu, Jun 19, 2014 at 06:12:41PM -0400, Alvaro Herrera wrote: > > > BTW I hacked up pg_resetxlog a bit to make it generate the necessary > > > pg_multixact/offset file when -m is given. This is not acceptable for > > > commit because it duplicates the #defines from pg_multixact.c, but maybe > > > we want this functionality enough that we're interested in a more > > > complete version of this patch; also it unconditionally writes one zero > > > byte to the file, which is of course wrong if the file exists and > > > already contains data. > > > > Why would we want this if the system functions fine without those files > > being created? I don't think we want pg_resetxlog to be doing anything > > except changing pg_controldata. > > I don't understand why you say the system functions fine. If you move > the multixactid with pg_resetxlog to a point which doesn't have the > necessary file and page, the system will not start. Only pg_upgrade > knows to create the file appropriately, but normal operation doesn't. True, but who would change the multi-xact except pg_upgrade, and would you want pg_resetxlog to be changing other files in the file system, perhaps as part of some disaster recovery operation? The larger question is whether the backend code should more cleanly handle such cases. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Fri, Jun 20, 2014 at 01:41:42PM -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > Bruce Momjian wrote: > >> Why would we want this if the system functions fine without those files > >> being created? I don't think we want pg_resetxlog to be doing anything > >> except changing pg_controldata. > > > I don't understand why you say the system functions fine. If you move > > the multixactid with pg_resetxlog to a point which doesn't have the > > necessary file and page, the system will not start. Only pg_upgrade > > knows to create the file appropriately, but normal operation doesn't. > > Yeah. In my mind, at least, there's been a TODO for some time for > pg_resetxlog to make sure that pg_clog and the other SLRU-like files > get populated appropriately when you tell it to change those counters. > You have to have a segment file at the right place or startup fails. > > I'm not sure why Alvaro chose to fix this only for pg_multixact, but > fixing that case is better than not fixing anything. Well, if we are going to do this, we had better do all cases or the behavior will be quite surprising, and when doing disaster recovery, surprises are not good. I am a little worried that removing files as part of pg_resetxlog might cause data to be lost as users try to figure things out. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian <bruce@momjian.us> writes: > On Fri, Jun 20, 2014 at 01:41:42PM -0400, Tom Lane wrote: >> Yeah. In my mind, at least, there's been a TODO for some time for >> pg_resetxlog to make sure that pg_clog and the other SLRU-like files >> get populated appropriately when you tell it to change those counters. >> You have to have a segment file at the right place or startup fails. > Well, if we are going to do this, we had better do all cases or the > behavior will be quite surprising, and when doing disaster recovery, > surprises are not good. I am a little worried that removing files as > part of pg_resetxlog might cause data to be lost as users try to figure > things out. Huh? The basic charter of pg_resetxlog is to throw away files, ie, all of your WAL. But in this case what we're talking about is a secondary function, namely the ability to move the XID, MXID, etc counter values. Up to now, if you did that, it was on your head to manually create appropriate SLRU segment files. It's certainly less error-prone, not more so, if the code were to take care of that for you. regards, tom lane
On Fri, Jun 20, 2014 at 02:24:55PM -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > On Fri, Jun 20, 2014 at 01:41:42PM -0400, Tom Lane wrote: > >> Yeah. In my mind, at least, there's been a TODO for some time for > >> pg_resetxlog to make sure that pg_clog and the other SLRU-like files > >> get populated appropriately when you tell it to change those counters. > >> You have to have a segment file at the right place or startup fails. > > > Well, if we are going to do this, we had better do all cases or the > > behavior will be quite surprising, and when doing disaster recovery, > > surprises are not good. I am a little worried that removing files as > > part of pg_resetxlog might cause data to be lost as users try to figure > > things out. > > Huh? The basic charter of pg_resetxlog is to throw away files, ie, all > of your WAL. > > But in this case what we're talking about is a secondary function, namely > the ability to move the XID, MXID, etc counter values. Up to now, if you > did that, it was on your head to manually create appropriate SLRU segment > files. It's certainly less error-prone, not more so, if the code were > to take care of that for you. I am fine with that, but let's do a complete job so the user API is not confusing. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Thu, Jun 19, 2014 at 06:04:25PM -0400, Alvaro Herrera wrote: > Bruce Momjian wrote: > > > I wasn't happy with having that delete code added there when we do > > directory delete in the function above. I instead broke apart the > > delete and copy code and called the delete code where needed, in the > > attached patch. > > Makes sense, yeah. I didn't look closely enough to realize that the > function that does the copying also does the rmtree part. Hearing nothing, patch applied back through 9.3. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Fri, Jun 20, 2014 at 02:07:25PM -0400, Bruce Momjian wrote: > On Thu, Jun 19, 2014 at 09:06:54PM -0400, Alvaro Herrera wrote: > > Alvaro Herrera wrote: > > > Bruce Momjian wrote: > > > > > Bruce Momjian wrote: > > > > > > > > OK, so the xid has to be beyond 2^31 during pg_upgrade to trigger a > > > > > problem? That might explain the rare reporting of this bug. What would > > > > > the test query look like so we can tell people when to remove the '0000' > > > > > files? Would we need to see the existence of '0000' and high-numbered > > > > > files? How high? What does a 2^31 file look like? > > > > > > > > Also, what would a legitimate 0000 file at wrap-around time look like? > > > > Would there have to be an 'ffff' or 'ffffff' file? > > > > > > Since I was wrong, there is no point in further research here. Anyway > > > the last file before wrapping around in pg_multixact/members is FFFF. > > > > Oops, I meant the last file before wrap in pg_multixact/offsets is FFFF, > > which is what we're talking about in this thread. > > > > For members it's 14078, but it's not relevant here. > > OK, so the next questions is, what will the minor-release-note query we > give users to test this look like? Do we expect no gaps in numbering? > Is it enough to test for the existance of '0000' and lack of '0001' and > 'FFFF'? Basically, if we expect no gaps in normal numbering, then a > '0000' with no number after it and no wrap-around number before it means > the '0000' is left over from initdb and can be removed. Assuming this is true, here is a query we can put in the next 9.3 minor release notes to tell users if they need to remove the '0000' file: WITH list(file) AS ( SELECT * FROM pg_ls_dir('pg_multixact/offsets') ) SELECT EXISTS (SELECT * FROM list WHERE file = '0000') AND NOT EXISTS (SELECT * FROM list WHERE file = '0001') AND NOT EXISTS (SELECT * FROM list WHERE file = 'FFFF') AND EXISTS (SELECT * FROM list WHERE file != '0000') AS file_removal_needed; Do they need to remove the members/0000 file too? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian wrote: > Assuming this is true, here is a query we can put in the next 9.3 minor > release notes to tell users if they need to remove the '0000' file: > > WITH list(file) AS > ( > SELECT * FROM pg_ls_dir('pg_multixact/offsets') > ) > SELECT EXISTS (SELECT * FROM list WHERE file = '0000') AND > NOT EXISTS (SELECT * FROM list WHERE file = '0001') AND > NOT EXISTS (SELECT * FROM list WHERE file = 'FFFF') AND > EXISTS (SELECT * FROM list WHERE file != '0000') > AS file_removal_needed; > > Do they need to remove the members/0000 file too? No, the members file shouldn't be removed; if they have used multixacts at all since upgrading, the members/0000 file contains the members of the first bunch of multis. Removing it would remove valid data. (Also I still think that removing the members/0000 file that initdb creates is a mistake.) I think the query above is correct. One possible quibble is a FFFF segment that isn't allocated in full. I don't think this is a problem, because although the 0000 file would survive deletion, it would be overwritten with zeroes as soon as the counter wraps around. Since 0000 is seen as "in the future" as soon as file 8000 starts being used, it shouldn't be a problem there. Another thing probably worth doing is have pg_upgrade set pg_database.datminmxid to the minimum of pg_class.relminmxid on each database. Right now these values are all initialized to 1, which is a problem. We have set_frozenxids() which appears to be related, although I don't readily understand what that is doing. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Alvaro Herrera wrote: > I think the query above is correct. One possible quibble is a FFFF > segment that isn't allocated in full. I don't think this is a problem, > because although the 0000 file would survive deletion, it would be > overwritten with zeroes as soon as the counter wraps around. Since 0000 > is seen as "in the future" as soon as file 8000 starts being used, it > shouldn't be a problem there. ... that said, there is a wraparound check in SimpleLruTruncate: if you are using file 8000 and above, slru.c refuses to remove the old files because it believes you may have already suffered wraparound, and you would get a WARNING message every time the truncation was attempted. The warnings would probably be pretty much invisible because of their low frequency. This might be improved by my commit this afternoon, which should cause multixact truncations to be attempted much more often. In any case, just removing the offending offsets/0000 file should clear it up. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 27, 2014 at 05:05:21PM -0400, Alvaro Herrera wrote: > Bruce Momjian wrote: > > > Assuming this is true, here is a query we can put in the next 9.3 minor > > release notes to tell users if they need to remove the '0000' file: > > > > WITH list(file) AS > > ( > > SELECT * FROM pg_ls_dir('pg_multixact/offsets') > > ) > > SELECT EXISTS (SELECT * FROM list WHERE file = '0000') AND > > NOT EXISTS (SELECT * FROM list WHERE file = '0001') AND > > NOT EXISTS (SELECT * FROM list WHERE file = 'FFFF') AND > > EXISTS (SELECT * FROM list WHERE file != '0000') > > AS file_removal_needed; > > > > Do they need to remove the members/0000 file too? > > No, the members file shouldn't be removed; if they have used multixacts > at all since upgrading, the members/0000 file contains the members of > the first bunch of multis. Removing it would remove valid data. (Also > I still think that removing the members/0000 file that initdb creates is > a mistake.) We never remove offset or members files with user data; we are only discussing the removal of empty files created by initdb. Can you explain why we would remove the offsets file but retain the members file created by initdb when we are _not_ preserving files from the old cluster? We are certainly setting the offset and member value from the old cluster: exec_prog(UTILITY_LOG_FILE, NULL, true, "\"%s/pg_resetxlog\" -m %u,%u \"%s\"", new_cluster.bindir, old_cluster.controldata.chkpnt_nxtmulti + 1, old_cluster.controldata.chkpnt_nxtmulti, new_cluster.pgdata); How are we not creating a gap in members files by setting the members value? Why will that not cause problems? > I think the query above is correct. One possible quibble is a FFFF > segment that isn't allocated in full. I don't think this is a problem, > because although the 0000 file would survive deletion, it would be > overwritten with zeroes as soon as the counter wraps around. Since 0000 > is seen as "in the future" as soon as file 8000 starts being used, it > shouldn't be a problem there. Understood. Here is an updated query with a better column label --- I think this is ready for the 9.3.X release notes, along with some text: WITH list(file) AS ( SELECT * FROM pg_ls_dir('pg_multixact/offsets') ) SELECT EXISTS (SELECT * FROM list WHERE file = '0000') AND NOT EXISTS (SELECT * FROM list WHERE file = '0001') AND NOT EXISTS (SELECT * FROM list WHERE file = 'FFFF') AND EXISTS (SELECT * FROM list WHERE file != '0000') AS file_0000_removal_required; > Another thing probably worth doing is have pg_upgrade set > pg_database.datminmxid to the minimum of pg_class.relminmxid on each > database. Right now these values are all initialized to 1, which is a > problem. We have set_frozenxids() which appears to be related, although > I don't readily understand what that is doing. I will address this in a separate email. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Fri, Jun 27, 2014 at 05:05:21PM -0400, Alvaro Herrera wrote: > Another thing probably worth doing is have pg_upgrade set > pg_database.datminmxid to the minimum of pg_class.relminmxid on each > database. Right now these values are all initialized to 1, which is a > problem. We have set_frozenxids() which appears to be related, although > I don't readily understand what that is doing. Yikes, OK. What pg_upgrade's set_frozenxids() does is to set the database and relation frozen xids to be the current xid, particularly template0 and the system tables, then pg_dump --binary-upgrade adjusts many of them to match the old cluster. What isn't happening, as you pointed out, is that we are not dealing with the new database and relation-level minmxid values in pg_upgrade or pg_dump --binary-upgrade. The attached 1k-line patch fixes that. What I am not sure about is how to set values from pre-9.3 clusters, and whether 9.3 pg_upgrade upgrades from pre-9.3 clusters are a problem. Are users who used pg_upgrade to go to 9.4 beta in trouble? I also have no way to know what value to use for pre-9.3 clusters --- I used controldata.chkpnt_nxtmulti in pg_upgrade (because the value was accessible), but 0 in pg_dump/pg_dumpall, like we already do for frozen xid values, but that usage is for major versions that pg_upgrade doesn't support, so it might be the wrong default. I am thinking that should be using controldata.chkpnt_nxtmulti, which exists back to 8.4, but I have no access to that value from pg_dump. In fact, the patch as it exists is flawed because it uses controldata.chkpnt_nxtmulti to set values from set_frozenxids(), because the value is accessible, but uses zero in pg_dump and pg_dumpall for pre-9.3 old clusters. :-( -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Attachment
Bruce Momjian wrote: > What I am not sure about is how to set values from pre-9.3 clusters, and > whether 9.3 pg_upgrade upgrades from pre-9.3 clusters are a problem. > Are users who used pg_upgrade to go to 9.4 beta in trouble? > > I also have no way to know what value to use for pre-9.3 clusters --- I > used controldata.chkpnt_nxtmulti in pg_upgrade (because the value was > accessible), but 0 in pg_dump/pg_dumpall, like we already do for frozen > xid values, but that usage is for major versions that pg_upgrade doesn't > support, so it might be the wrong default. I am thinking that should be > using controldata.chkpnt_nxtmulti, which exists back to 8.4, but I have > no access to that value from pg_dump. In fact, the patch as it exists > is flawed because it uses controldata.chkpnt_nxtmulti to set values from > set_frozenxids(), because the value is accessible, but uses zero in > pg_dump and pg_dumpall for pre-9.3 old clusters. :-( Bruce and I discussed this on IM and I think we have reached a conclusion on what needs to be done: * When upgrading from 9.2 or older, all tables need to have relminmxid set to oldestMulti. However, since pg_dump --binary-upgrade cannot extract useful values from the catalog, we will need to have the schema load create all tables with relminmxid=0. A subsequent UPDATE will fix the values. In this case, each database' datminmxid value is going to be set to pg_control's oldestMulti. If I recall correctly, oldestMulti is computed as nextMulti-1. * When upgrading from 9.3 or newer, the relminmxid values from the old cluster must be preserved. Also, datminmxid is going to be preserved. Finally, there is the question of what to do if the database has already been upgraded and thus the tables are all at relminmxid=1. As far as I can tell, if the original value of nextMulti was below 2^31, there should be no issue because vacuuming would advance the value normally. If the original value was beyond that point, then vacuum would have been bleating all along about the wraparound point. In this case, I think it should be enough the UPDATE the pg_class values to the current oldestMulti value from pg_control, but I haven't tested this. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
I forgot to add: Many thanks, Bruce, for working on this patch. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 1, 2014 at 03:01:06PM -0400, Alvaro Herrera wrote: > Bruce Momjian wrote: > > > What I am not sure about is how to set values from pre-9.3 clusters, and > > whether 9.3 pg_upgrade upgrades from pre-9.3 clusters are a problem. > > Are users who used pg_upgrade to go to 9.4 beta in trouble? > > > > I also have no way to know what value to use for pre-9.3 clusters --- I > > used controldata.chkpnt_nxtmulti in pg_upgrade (because the value was > > accessible), but 0 in pg_dump/pg_dumpall, like we already do for frozen > > xid values, but that usage is for major versions that pg_upgrade doesn't > > support, so it might be the wrong default. I am thinking that should be > > using controldata.chkpnt_nxtmulti, which exists back to 8.4, but I have > > no access to that value from pg_dump. In fact, the patch as it exists > > is flawed because it uses controldata.chkpnt_nxtmulti to set values from > > set_frozenxids(), because the value is accessible, but uses zero in > > pg_dump and pg_dumpall for pre-9.3 old clusters. :-( > > Bruce and I discussed this on IM and I think we have reached a > conclusion on what needs to be done: > > * When upgrading from 9.2 or older, all tables need to have relminmxid > set to oldestMulti. However, since pg_dump --binary-upgrade cannot > extract useful values from the catalog, we will need to have the > schema load create all tables with relminmxid=0. A subsequent UPDATE > will fix the values. OK, the updated attached patch does this. It repurposes set_frozenxids(). > In this case, each database' datminmxid value is going to be set to > pg_control's oldestMulti. > > If I recall correctly, oldestMulti is computed as nextMulti-1. We don't have oldestMulti in pg_controldata in pre-9.3, so we have to use nextMulti-1. > * When upgrading from 9.3 or newer, the relminmxid values from the old > cluster must be preserved. Also, datminmxid is going to be preserved. Yes, that was already in the patch. > Finally, there is the question of what to do if the database has already > been upgraded and thus the tables are all at relminmxid=1. As far as I > can tell, if the original value of nextMulti was below 2^31, there > should be no issue because vacuuming would advance the value normally. > If the original value was beyond that point, then vacuum would have been > bleating all along about the wraparound point. In this case, I think it > should be enough the UPDATE the pg_class values to the current > oldestMulti value from pg_control, but I haven't tested this. Well, we are already having users run a query for the 9.3.X minor version upgrade to optionally remove the 0000 file. Is there something else they should run to test for this? We certainly could check for files >= 8000, but I am not sure that is sufficient. We would then need them to somehow update all the database/relation minmxid fields, and I am not even sure what value we should set it to. Is that something we want to publish? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Attachment
On Wed, Jul 2, 2014 at 11:56:34AM -0400, Alvaro Herrera wrote: > > How are we not creating a gap in members files by setting the members > > value? Why will that not cause problems? > > We're not setting the offset and member value here; we're setting the > "nextMulti" and "oldestMulti" values. Both affect the offsets counter, > not the members counter. The members counter is kept at zero, so the > next multi to be allocated will create members starting from the first > members position in page zero. initdb of the new cluster created the > first members page, which corresponds to the first members value that > will be used. > > Note pg_resetxlog --help says: > > -m MXID,MXID set next and oldest multitransaction ID > > I think you're confusing the fact that we pass two values here (next and > oldest, both apply to offset counters) with offsets vs. members. > > To affect the members counter you would use this other pg_resetxlog switch: > -O OFFSET set next multitransaction offset > which, AFAICS, we only use when upgrading from a 9.3+ cluster to a newer > 9.3+ cluster (and we also copy the files from old cluster to the new > one). OK, thanks for the analysis. Attached patch applied back through 9.3. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Attachment
Bruce Momjian wrote: > On Fri, Jun 27, 2014 at 05:05:21PM -0400, Alvaro Herrera wrote: > > > Do they need to remove the members/0000 file too? > > > > No, the members file shouldn't be removed; if they have used multixacts > > at all since upgrading, the members/0000 file contains the members of > > the first bunch of multis. Removing it would remove valid data. (Also > > I still think that removing the members/0000 file that initdb creates is > > a mistake.) > > We never remove offset or members files with user data; we are only > discussing the removal of empty files created by initdb. Can you > explain why we would remove the offsets file but retain the members file > created by initdb when we are _not_ preserving files from the old > cluster? We are certainly setting the offset and member value from the > old cluster: > > exec_prog(UTILITY_LOG_FILE, NULL, true, > "\"%s/pg_resetxlog\" -m %u,%u \"%s\"", > new_cluster.bindir, > old_cluster.controldata.chkpnt_nxtmulti + 1, > old_cluster.controldata.chkpnt_nxtmulti, > new_cluster.pgdata); > > How are we not creating a gap in members files by setting the members > value? Why will that not cause problems? We're not setting the offset and member value here; we're setting the "nextMulti" and "oldestMulti" values. Both affect the offsets counter, not the members counter. The members counter is kept at zero, so the next multi to be allocated will create members starting from the first members position in page zero. initdb of the new cluster created the first members page, which corresponds to the first members value that will be used. Note pg_resetxlog --help says: -m MXID,MXID set next and oldest multitransaction ID I think you're confusing the fact that we pass two values here (next and oldest, both apply to offset counters) with offsets vs. members. To affect the members counter you would use this other pg_resetxlog switch: -O OFFSET set next multitransaction offset which, AFAICS, we only use when upgrading from a 9.3+ cluster to a newer 9.3+ cluster (and we also copy the files from old cluster to the new one). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 1, 2014 at 03:30:47PM -0400, Bruce Momjian wrote: > > * When upgrading from 9.2 or older, all tables need to have relminmxid > > set to oldestMulti. However, since pg_dump --binary-upgrade cannot > > extract useful values from the catalog, we will need to have the > > schema load create all tables with relminmxid=0. A subsequent UPDATE > > will fix the values. > > OK, the updated attached patch does this. It repurposes > set_frozenxids(). OK, patch applied back through 9.3. This leaves the two queries users are going to have to run in 9.3.X. The first will tell users if they should remove offsets/0000: WITH list(file) AS ( SELECT * FROM pg_ls_dir('pg_multixact/offsets') ) SELECT EXISTS (SELECT * FROM list WHERE file = '0000') AND NOT EXISTS (SELECT * FROM list WHERE file = '0001') AND NOT EXISTS (SELECT * FROM list WHERE file = 'FFFF') AND EXISTS (SELECT * FROM list WHERE file != '0000') AS file_0000_removal_required; The second will tell them if they should go to a wiki page that will show them how to set database and relation minmxid values: WITH list(file) AS ( SELECT * FROM pg_ls_dir('pg_multixact/offsets') ) SELECT EXISTS (SELECT * FROM list WHERE file ~ '^[8-9A-F]') AND EXISTS (SELECT * FROM pg_database WHERE datminmxid = 1) AS update_database_and_relation_minmxid; I will work on the wiki page now. I think this issue is now ready for 9.3.X, once we write the wiki page and release note text. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On Wed, Jul 2, 2014 at 03:32:55PM -0400, Bruce Momjian wrote: > The second will tell them if they should go to a wiki page that will > show them how to set database and relation minmxid values: > > WITH list(file) AS > ( > SELECT * FROM pg_ls_dir('pg_multixact/offsets') > ) > SELECT EXISTS (SELECT * FROM list WHERE file ~ '^[8-9A-F]') AND > EXISTS (SELECT * FROM pg_database WHERE datminmxid = 1) > AS update_database_and_relation_minmxid; > > I will work on the wiki page now. > > I think this issue is now ready for 9.3.X, once we write the wiki page > and release note text. I have completed the wiki page for the second bug fix: https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix The first bug fix just requires removal of the '0000' file, and I think can be fully explained in the release notes. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Bruce Momjian <bruce@momjian.us> writes: > On Tue, Jul 1, 2014 at 03:01:06PM -0400, Alvaro Herrera wrote: >> Finally, there is the question of what to do if the database has already >> been upgraded and thus the tables are all at relminmxid=1. As far as I >> can tell, if the original value of nextMulti was below 2^31, there >> should be no issue because vacuuming would advance the value normally. >> If the original value was beyond that point, then vacuum would have been >> bleating all along about the wraparound point. In this case, I think it >> should be enough the UPDATE the pg_class values to the current >> oldestMulti value from pg_control, but I haven't tested this. > Well, we are already having users run a query for the 9.3.X minor > version upgrade to optionally remove the 0000 file. Is there something > else they should run to test for this? We certainly could check for > files >= 8000, but I am not sure that is sufficient. We would then need > them to somehow update all the database/relation minmxid fields, and I > am not even sure what value we should set it to. Is that something we > want to publish? I started transcribing Bruce's proposed fix procedure at https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix into the release notes, but I'm afraid it's all wet. He's suggesting copying the last checkpoint's NextMultiXactId into datminmxid/relminmxid, which is surely the wrong thing: that's likely to be newer than all mxids in the tables, not older than them. I thought at first that this was a simple thinko and he meant to write oldestMultiXid, but here's the thing: if we're in the situation where we've got wraparound, isn't oldestMultiXid going to be 1? The value recorded in the checkpoint isn't magic, it's just going to be extracted from whatever's in pg_database; and the whole problem here is that we can't trust that data. Where can we get a useful lower bound from? I'm a bit inclined to not say anything about fix procedures in the release notes, because I'm not sure that this is a problem in the field. If anybody did have a wraparound they'd be getting bleats from VACUUM, and no one has reported any such thing that I've heard. regards, tom lane
I wrote: > I started transcribing Bruce's proposed fix procedure at > https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix > into the release notes, but I'm afraid it's all wet. > He's suggesting copying the last checkpoint's NextMultiXactId into > datminmxid/relminmxid, which is surely the wrong thing: that's likely to > be newer than all mxids in the tables, not older than them. I thought at > first that this was a simple thinko and he meant to write oldestMultiXid, > but here's the thing: if we're in the situation where we've got > wraparound, isn't oldestMultiXid going to be 1? The value recorded in the > checkpoint isn't magic, it's just going to be extracted from whatever's in > pg_database; and the whole problem here is that we can't trust that data. > Where can we get a useful lower bound from? Ugh: it's worse than that. pg_upgrade itself is using this utterly nonsensical logic to set datminmxid/relminmxid. This is a stop-ship issue for 9.3.5. After some reflection it seems to me that we could estimate oldestmxid for a pre-9.3 source cluster as the NextMultiXactId from its pg_control less 2000000000 or so. This will nearly always be much older than the actual oldest mxid, but that's okay --- the next vacuuming cycle will advance the datminmxid/relminmxid values to match reality, so long as they aren't wrapped around already. Note that there's already an assumption baked into pg_upgrade that 2E9 xids or mxids back is safely past the oldest actual data; see where it sets autovacuum_freeze_max_age and autovacuum_multixact_freeze_max_age while starting the new cluster. (Hm ... I guess "2000000000 or so" actually needs to be a bit less than that, otherwise autovacuum might kick off while we're munging the new cluster.) We could recommend the same estimate in the instructions about cleaning up a previous pg_upgrade by hand. regards, tom lane
On 2014-07-20 13:37:01 -0400, Tom Lane wrote: > Bruce Momjian <bruce@momjian.us> writes: > > On Tue, Jul 1, 2014 at 03:01:06PM -0400, Alvaro Herrera wrote: > >> Finally, there is the question of what to do if the database has already > >> been upgraded and thus the tables are all at relminmxid=1. As far as I > >> can tell, if the original value of nextMulti was below 2^31, there > >> should be no issue because vacuuming would advance the value normally. > >> If the original value was beyond that point, then vacuum would have been > >> bleating all along about the wraparound point. In this case, I think it > >> should be enough the UPDATE the pg_class values to the current > >> oldestMulti value from pg_control, but I haven't tested this. > > > Well, we are already having users run a query for the 9.3.X minor > > version upgrade to optionally remove the 0000 file. Is there something > > else they should run to test for this? We certainly could check for > > files >= 8000, but I am not sure that is sufficient. We would then need > > them to somehow update all the database/relation minmxid fields, and I > > am not even sure what value we should set it to. Is that something we > > want to publish? > > I started transcribing Bruce's proposed fix procedure at > https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix > into the release notes, but I'm afraid it's all wet. I don't understand why we should do anything but remove the 0000 file if it's all zeroes? This seems far too complicated. Beside the fact that I doubt it's actually achieving anything reliably? > I'm a bit inclined to not say anything about fix procedures in the release > notes, because I'm not sure that this is a problem in the field. If > anybody did have a wraparound they'd be getting bleats from VACUUM, and no > one has reported any such thing that I've heard. There actually have been a couple reports about the general problem I think - reacting to one was how I noticed the bug. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-07-20 13:37:01 -0400, Tom Lane wrote: >> I started transcribing Bruce's proposed fix procedure at >> https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix >> into the release notes, but I'm afraid it's all wet. > I don't understand why we should do anything but remove the 0000 file if > it's all zeroes? This seems far too complicated. Beside the fact that I > doubt it's actually achieving anything reliably? This one is not about the extra 0000 file. It's about whether datminmxid and relminmxid are wrong. In the previous coding of pg_upgrade, they'd have been left at "1" even if that value has wrapped around into the future. Now, if you were lucky, you'd have no bad side-effects even if they had wrapped around ... but if you're not so lucky, you'd get failures due to accesses to nonexistent pg_multixact files, after vacuum has truncated pg_multixact in the mistaken impression that there are no remaining references to old mxids. The solution I just proposed of forcing the minmxid values down to "now - 2E9" is not that great btw; I think it will result in autovacuum deciding it has to scan everything in sight, which is the sort of intensive operation that pg_upgrade was intended to avoid. I'm not sure we have any alternatives though. regards, tom lane
On 2014-07-20 16:16:18 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-07-20 13:37:01 -0400, Tom Lane wrote: > >> I started transcribing Bruce's proposed fix procedure at > >> https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix > >> into the release notes, but I'm afraid it's all wet. > > > I don't understand why we should do anything but remove the 0000 file if > > it's all zeroes? This seems far too complicated. Beside the fact that I > > doubt it's actually achieving anything reliably? > > This one is not about the extra 0000 file. It's about whether datminmxid > and relminmxid are wrong. In the previous coding of pg_upgrade, they'd > have been left at "1" even if that value has wrapped around into the > future. Yea, I'm rereading the thread atm. I'd stopped following it after a while and didn't notice the drift into a second problem. > Now, if you were lucky, you'd have no bad side-effects even if they had > wrapped around ... Hm. If relminmxid = 1 is still in the past everything should be fine, right? The next vacuum (which will run soon) will just set it anew. But if not /* relminmxid must never go backward, either */ if (MultiXactIdIsValid(minmulti) && MultiXactIdPrecedes(pgcform->relminmxid, minmulti)) { pgcform->relminmxid = minmulti; dirty = true; } will prevent it from being changed. Similarly /* ditto */ if (MultiXactIdPrecedes(dbform->datminmxid, newMinMulti)) { dbform->datminmxid = newMinMulti; dirty = true; } will prevent datminmxid from becoming sane. vac_truncate_clog() will be called with ReadNextMultiXactId() - - autovacuum_multixact_freeze_max_age as the truncation for multis. That itself would be fine since there won't be any actual multis in that range. The problem is that we might miss having to do a full table vacuum because we check relminmxid to determine that: scan_all |= MultiXactIdPrecedesOrEquals(onerel->rd_rel->relminmxid, mxactFullScanLimit); which then could cause problems in the future. Imo that means we need to fix this :( for existing clusters. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-07-20 15:55:25 -0400, Tom Lane wrote: > I wrote: > > I started transcribing Bruce's proposed fix procedure at > > https://wiki.postgresql.org/wiki/20140702pg_upgrade_fix > > into the release notes, but I'm afraid it's all wet. > > > He's suggesting copying the last checkpoint's NextMultiXactId into > > datminmxid/relminmxid, which is surely the wrong thing: that's likely to > > be newer than all mxids in the tables, not older than them. I thought at > > first that this was a simple thinko and he meant to write oldestMultiXid, > > but here's the thing: if we're in the situation where we've got > > wraparound, isn't oldestMultiXid going to be 1? The value recorded in the > > checkpoint isn't magic, it's just going to be extracted from whatever's in > > pg_database; and the whole problem here is that we can't trust that data. > > Where can we get a useful lower bound from? > > Ugh: it's worse than that. pg_upgrade itself is using this utterly > nonsensical logic to set datminmxid/relminmxid. This is a stop-ship > issue for 9.3.5. > > After some reflection it seems to me that we could estimate oldestmxid for > a pre-9.3 source cluster as the NextMultiXactId from its pg_control less > 2000000000 or so. This will nearly always be much older than the actual > oldest mxid, but that's okay --- the next vacuuming cycle will advance the > datminmxid/relminmxid values to match reality, so long as they aren't > wrapped around already. I think NextMultiXactId should be fine when upgrading from pre 9.3 clusters. Pre 9.3 multis that have been written before a shutdown don't really have a meaning: To the point that we actually never look at them. MultiXactIdSetOldestVisible() would have set oldestMXact to MultiXactState->nextMXact if no backend has any active ones. GetMultiXactIdMembers et al will deal with that. So I think for pg_upgrade itself using NextMulti as a replacement for oldestMXact is fine. Now that obviously does *not* mean it's safe to do that to after the cluster has already been running 9.3+. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-07-20 16:16:18 -0400, Tom Lane wrote: >> This one is not about the extra 0000 file. It's about whether datminmxid >> and relminmxid are wrong. In the previous coding of pg_upgrade, they'd >> have been left at "1" even if that value has wrapped around into the >> future. > Yea, I'm rereading the thread atm. I'd stopped following it after a > while and didn't notice the drift into a second problem. I've been doing more analysis on this. It looks to me like most people would not notice a problem, but ... 1. When updating from a pre-9.3 version, pg_upgrade sets pg_control's oldestMultiXid to equal the old cluster's NextMultiXactId (with a useless increment of the new NextMultiXactId, but that's not too relevant). This might seem silly, because there are very probably mxids out there that are before this value, but the multixact code is designed to assume that LOCKED_ONLY mxids before oldestMultiXid are not running --- without ever going to pg_multixact to check. So in fact, there will be no attempts to access on-disk data about old mxids; at least as long as the mxid counters haven't wrapped around. 2. However, pg_upgrade also sets datminmxid/relminmxid to equal the old cluster's NextMultiXactId. The trouble with this is that it might fool (auto)vacuum into never seeing and freezing the pre-upgrade mxids; they're in the table but the metadata says not, so we'd not force a full table scan to find them. If such a mxid manages to survive past wraparound, it'll be "in the future" and the multixact code will start complaining about it, like this: if (!MultiXactIdPrecedes(multi, nextMXact)) ereport(ERROR, (errcode(ERRCODE_INTERNAL_ERROR), errmsg("MultiXactId %u has not been created yet -- apparent wraparound", multi))); 3. In practice, because full-table vacuum scans get forced periodically anyway to advance relminxid, it's entirely likely that old mxids would get frozen before there was a problem. vacuum will freeze an old mxid on sight, whatever the reason for visiting it. 4. The patch Bruce applied to initialize datminmxid/relminmxid to the old NextMultiXactId rather than 1 does not fundamentally change anything here. It narrows the window in which wraparound can cause problems, but only by the distance that "1" is in-the-future at the time of upgrade. Indeed, you could argue that it makes things worse, because it *guarantees* that vacuum is unaware of old mxids existing in the table, whereas with the "1" you had no such problem unless you were more than halfway to the wrap point. What we've effectively got ATM, with either the patched or unpatched code, is that you're safe to the extent that relminxid-driven freezing happens often enough to get rid of mxids before they wrap. I note that that's exactly the situation we had pre-9.3. That being the case, and in view of the lack of complaints from the field about it, I wonder whether we aren't greatly overreacting. If the alternative is that pg_upgrade forces cluster-wide freezing activity to occur immediately upon system startup, I'd definitely say that the cure is worse than the disease. regards, tom lane
On 2014-07-20 17:22:48 -0400, Tom Lane wrote: > 2. However, pg_upgrade also sets datminmxid/relminmxid to equal the old > cluster's NextMultiXactId. The trouble with this is that it might fool > (auto)vacuum into never seeing and freezing the pre-upgrade mxids; they're > in the table but the metadata says not, so we'd not force a full table > scan to find them. We effectively can't actually assume those are going to be vacuumed away anyway. There might have been 2**32 mxids in the older cluster already - no value of datminmxid/relminmxid can protect us against that. > 4. The patch Bruce applied to initialize datminmxid/relminmxid to the old > NextMultiXactId rather than 1 does not fundamentally change anything here. > It narrows the window in which wraparound can cause problems, but only by > the distance that "1" is in-the-future at the time of upgrade. I think it's actually more than that. Consider what happens if pg_upgrade has used pg_resetxlog to set nextMulti to > 2^31. If rel/datminmxid are set to 1 regardless vac_update_relstats() and vac_update_datfrozenxid() won't increase them anymore because of: /* relminmxid must never go backward, either */ if (MultiXactIdIsValid(minmulti) && MultiXactIdPrecedes(pgcform->relminmxid, minmulti)) { pgcform->relminmxid = minmulti; dirty = true; } And that can actually cause significant problems once 9.3+ creates new multis because they'll never get vacuumed away but still do get truncated. If it's an updating multi xmax that can effectively make the row unreadable - not just block updates. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-07-20 17:22:48 -0400, Tom Lane wrote: >> 4. The patch Bruce applied to initialize datminmxid/relminmxid to the old >> NextMultiXactId rather than 1 does not fundamentally change anything here. >> It narrows the window in which wraparound can cause problems, but only by >> the distance that "1" is in-the-future at the time of upgrade. > I think it's actually more than that. Consider what happens if > pg_upgrade has used pg_resetxlog to set nextMulti to > 2^31. If > rel/datminmxid are set to 1 regardless vac_update_relstats() and > vac_update_datfrozenxid() won't increase them anymore because of: > /* relminmxid must never go backward, either */ > if (MultiXactIdIsValid(minmulti) && > MultiXactIdPrecedes(pgcform->relminmxid, minmulti)) > { > pgcform->relminmxid = minmulti; > dirty = true; > } > And that can actually cause significant problems once 9.3+ creates new > multis because they'll never get vacuumed away but still do get > truncated. If it's an updating multi xmax that can effectively make the > row unreadable - not just block updates. No, I don't think so. Truncation is driven off oldestMultiXid from pg_control, not from relminmxid. The only thing in-the-future values of those will do to us is prevent autovacuum from thinking it must do a full table scan. (In particular, in-the-future values do not cause oldestMultiXid to get advanced, because we're always looking for the oldest value not the newest.) But in any case, we both agree that setting relminmxid to equal nextMulti is completely unsafe in a 9.3 cluster that's already been up. So the proposed fix instructions are certainly wrong. regards, tom lane
On 2014-07-20 17:43:04 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-07-20 17:22:48 -0400, Tom Lane wrote: > >> 4. The patch Bruce applied to initialize datminmxid/relminmxid to the old > >> NextMultiXactId rather than 1 does not fundamentally change anything here. > >> It narrows the window in which wraparound can cause problems, but only by > >> the distance that "1" is in-the-future at the time of upgrade. > > > I think it's actually more than that. Consider what happens if > > pg_upgrade has used pg_resetxlog to set nextMulti to > 2^31. If > > rel/datminmxid are set to 1 regardless vac_update_relstats() and > > vac_update_datfrozenxid() won't increase them anymore because of: > > /* relminmxid must never go backward, either */ > > if (MultiXactIdIsValid(minmulti) && > > MultiXactIdPrecedes(pgcform->relminmxid, minmulti)) > > { > > pgcform->relminmxid = minmulti; > > dirty = true; > > } > > > And that can actually cause significant problems once 9.3+ creates new > > multis because they'll never get vacuumed away but still do get > > truncated. If it's an updating multi xmax that can effectively make the > > row unreadable - not just block updates. > > No, I don't think so. Truncation is driven off oldestMultiXid from > pg_control, not from relminmxid. The only thing in-the-future values of > those will do to us is prevent autovacuum from thinking it must do a full > table scan. (In particular, in-the-future values do not cause > oldestMultiXid to get advanced, because we're always looking for the > oldest value not the newest.) Right. But that's the problem. If oldestMulti is set to, say, 3000000000 by pg_resetxlog during pg_upgrade but *minmxid = 1 those tables won't be full tables scanned because of multixacts. But vac_truncate_clog() will SetMultiXactIdLimit(minMulti, minmulti_datoid); regardless. Note that it'll not notice the limit of other databases in this case because vac_truncate_clog() will effectively use the in memory GetOldestMultiXactId() and check if other databases are before that. But there won't be any because they all appear in the future. Due to that the next checkpoint will truncate the clog to the cutoff multi xid used by the last vacuum. Am I missing something? > But in any case, we both agree that setting relminmxid to equal nextMulti > is completely unsafe in a 9.3 cluster that's already been up. So the > proposed fix instructions are certainly wrong. Right. I'm pondering what to do about it instead. The best idea I have is something like: 1) Jot down pg_controldata|grep NextMultiXactId 2) kill/wait for all existing transactions to end 3) vacuum all databases with vacuum_multixact_freeze_min_age=0. That'll get rid of all old appearing multis 4) Update pg_class to set relminmxid=value from 1), same with pg_database But that sucks and doesn't deal with all the problems :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-07-20 17:43:04 -0400, Tom Lane wrote: >> No, I don't think so. Truncation is driven off oldestMultiXid from >> pg_control, not from relminmxid. The only thing in-the-future values of >> those will do to us is prevent autovacuum from thinking it must do a full >> table scan. (In particular, in-the-future values do not cause >> oldestMultiXid to get advanced, because we're always looking for the >> oldest value not the newest.) > Right. But that's the problem. If oldestMulti is set to, say, 3000000000 > by pg_resetxlog during pg_upgrade but *minmxid = 1 those tables won't be > full tables scanned because of multixacts. But vac_truncate_clog() will > SetMultiXactIdLimit(minMulti, minmulti_datoid); > regardless. > Note that it'll not notice the limit of other databases in this case > because vac_truncate_clog() will effectively use the in memory > GetOldestMultiXactId() and check if other databases are before that. But > there won't be any because they all appear in the future. Due to that > the next checkpoint will truncate the clog to the cutoff multi xid used > by the last vacuum. Right. > Am I missing something? My point is that the cutoff multi xid won't be new enough to remove non-LOCKED_ONLY (ie, post-9.3) mxids. >> But in any case, we both agree that setting relminmxid to equal nextMulti >> is completely unsafe in a 9.3 cluster that's already been up. So the >> proposed fix instructions are certainly wrong. > Right. I'm pondering what to do about it instead. The best idea I have > is something like: > 1) Jot down pg_controldata|grep NextMultiXactId > 2) kill/wait for all existing transactions to end > 3) vacuum all databases with vacuum_multixact_freeze_min_age=0. That'll > get rid of all old appearing multis > 4) Update pg_class to set relminmxid=value from 1), same with > pg_database > But that sucks and doesn't deal with all the problems :( Yeah. At this point I'm of the opinion that we should not recommend any manual corrective actions for this issue. They're likely to do more harm than good, especially if the user misses or fat-fingers any steps. I'm also thinking that the lack of any complaints suggests there are few or no existing installations with nextMulti past 2^31, anyhow. If it were even past 400000000 (default autovacuum_multixact_freeze_max_age), we'd have been hearing howls of anguish about full-database freezing scans occurring immediately after a pg_upgrade (thanks to minmxid = 1 being old enough to trigger that). So the way I've documented this patch in the draft release notes is that it prevents the latter problem. regards, tom lane
On 2014-07-20 18:16:51 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-07-20 17:43:04 -0400, Tom Lane wrote: > >> No, I don't think so. Truncation is driven off oldestMultiXid from > >> pg_control, not from relminmxid. The only thing in-the-future values of > >> those will do to us is prevent autovacuum from thinking it must do a full > >> table scan. (In particular, in-the-future values do not cause > >> oldestMultiXid to get advanced, because we're always looking for the > >> oldest value not the newest.) > > > Right. But that's the problem. If oldestMulti is set to, say, 3000000000 > > by pg_resetxlog during pg_upgrade but *minmxid = 1 those tables won't be > > full tables scanned because of multixacts. But vac_truncate_clog() will > > SetMultiXactIdLimit(minMulti, minmulti_datoid); > > regardless. > > > Note that it'll not notice the limit of other databases in this case > > because vac_truncate_clog() will effectively use the in memory > > GetOldestMultiXactId() and check if other databases are before that. But > > there won't be any because they all appear in the future. Due to that > > the next checkpoint will tru6ncate the clog to the cutoff multi xid used > > by the last vacuum. > > Right. > > > Am I missing something? > > My point is that the cutoff multi xid won't be new enough to remove > non-LOCKED_ONLY (ie, post-9.3) mxids. Why not? Afaics this will continue to happen until multixacts are wrapped around once? So the cutoff multi will be new enough for that at some point after the pg_upgrade? Luckily in most cases full table vacuums triggered due to normal xids will prevent bad problems though. There have been a couple reports where people included pg_controldata output indicating crazy rates of multixid consumption but I think none of those were crazy enough to burn multis so fast that important ones get truncated before a full table vacuum occurs due to normal xids. > >> But in any case, we both agree that setting relminmxid to equal nextMulti > >> is completely unsafe in a 9.3 cluster that's already been up. So the > >> proposed fix instructions are certainly wrong. > > > Right. I'm pondering what to do about it instead. The best idea I have > > is something like: > > 1) Jot down pg_controldata|grep NextMultiXactId > > 2) kill/wait for all existing transactions to end > > 3) vacuum all databases with vacuum_multixact_freeze_min_age=0. That'll > > get rid of all old appearing multis > > 4) Update pg_class to set relminmxid=value from 1), same with > > pg_database > > > But that sucks and doesn't deal with all the problems :( > > Yeah. At this point I'm of the opinion that we should not recommend any > manual corrective actions for this issue. They're likely to do more harm > than good, especially if the user misses or fat-fingers any steps. I don't really see us coming up with something robust in time :/. It's a bid sad, but maybe we should recommend contacting the mailing list if pg_upgrade has been used and nextMulti is above 2^31? Btw, we really should have txid_current() equivalent for multis... > I'm also thinking that the lack of any complaints suggests there are few > or no existing installations with nextMulti past 2^31, anyhow. If it were > even past 400000000 (default autovacuum_multixact_freeze_max_age), we'd > have been hearing howls of anguish about full-database freezing scans > occurring immediately after a pg_upgrade (thanks to minmxid = 1 being old > enough to trigger that). I think people just chalk that up to 'crazy pg vacuuming behaviour' and not investigate further. At least that's my practical experience :( Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-07-20 18:16:51 -0400, Tom Lane wrote: >> My point is that the cutoff multi xid won't be new enough to remove >> non-LOCKED_ONLY (ie, post-9.3) mxids. > Why not? Afaics this will continue to happen until multixacts are > wrapped around once? So the cutoff multi will be new enough for that at > some point after the pg_upgrade? Before that happens, nextMultiXid will catch up with the minmxid = 1 values, and they'll be in the past, and then we're at the same point that we're at to begin with if we used 9.3.5 pg_upgrade. > Luckily in most cases full table vacuums triggered due to normal xids > will prevent bad problems though. Yeah. While it's not that comfortable to rely on that, we were reliant on that effect in every pre-9.3 branch, so I'm not terribly upset about it continuing to be the case in existing 9.3 installations. As long as they're not consuming mxids at a rate much greater than xids, they'll be all right; and things will be unconditionally safe once the freezing process has advanced far enough. regards, tom lane
On 2014-07-20 18:39:06 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-07-20 18:16:51 -0400, Tom Lane wrote: > >> My point is that the cutoff multi xid won't be new enough to remove > >> non-LOCKED_ONLY (ie, post-9.3) mxids. > > > Why not? Afaics this will continue to happen until multixacts are > > wrapped around once? So the cutoff multi will be new enough for that at > > some point after the pg_upgrade? > > Before that happens, nextMultiXid will catch up with the minmxid = 1 > values, and they'll be in the past, and then we're at the same point > that we're at to begin with if we used 9.3.5 pg_upgrade. There'll be problems earlier than that afaics. Consider what happens if a pg_upgrade happened at NextMulti=3000000000 which will have set minmxid=1. If, after that, a couple 100k multis have been used, but fewer than autovacuum_freeze_max_age normal xids were consumed we're in an unfortunate situation. If any *single* full table vacuum after that calls vac_update_datfrozenxid() which just needs its datfrozenxid advance by one we're in trouble: vac_truncate_clog() will be called with minMulti = GetOldestMultiXactId(). Note that the latter only returns the oldest *running* multi. Because vac_truncate_clog() only finds 1s in datfrozenxid it'll ignore all of them because of: if (MultiXactIdPrecedes(dbform->datminmxid, minMulti)) { minMulti = dbform->datminmxid; minmulti_datoid = HeapTupleGetOid(tuple); } and calls SetMultiXactIdLimit(minMulti, minmulti_datoid); If the vacuum happened without a concurrent backend using multis active this will set the limit to the *current* NextMulti. Then TruncateMultiXact() called after the checkpoint will truncate away everything up to the current NextMulti essentially corrupting rows created in 9.3+. > > Luckily in most cases full table vacuums triggered due to normal xids > > will prevent bad problems though. > > Yeah. While it's not that comfortable to rely on that, we were reliant > on that effect in every pre-9.3 branch, so I'm not terribly upset about > it continuing to be the case in existing 9.3 installations. Well, there's a pretty fundamental distinction to < 9.3: The worst that could happen before is a transient error while *writing*. Now it can happen during reading. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > If any *single* full table vacuum after that calls > vac_update_datfrozenxid() which just needs its datfrozenxid advance by > one we're in trouble: vac_truncate_clog() will be called with minMulti = > GetOldestMultiXactId(). Uh, no, the cutoff is GetOldestMultiXactId minus vacuum_multixact_freeze_min_age (or the autovac equivalent). See vacuum_set_xid_limits. regards, tom lane
I wrote: > Andres Freund <andres@2ndquadrant.com> writes: >> If any *single* full table vacuum after that calls >> vac_update_datfrozenxid() which just needs its datfrozenxid advance by >> one we're in trouble: vac_truncate_clog() will be called with minMulti = >> GetOldestMultiXactId(). > Uh, no, the cutoff is GetOldestMultiXactId minus > vacuum_multixact_freeze_min_age (or the autovac equivalent). Oh, wait, I see what you're on about: that cutoff doesn't constrain the global value computed by vac_truncate_clog. Hmm. I wonder whether we should change vac_update_relstats so that it only applies the "relminxid mustn't go backwards" rule as long as relminxid is sane, ie, not in the future. If it is in the future, forcibly update it to the cutoff we actually used. Likewise for relminmxid. And I guess we'd need a similar rule for updating datmin(m)xid. regards, tom lane
On 2014-07-20 19:04:30 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > If any *single* full table vacuum after that calls > > vac_update_datfrozenxid() which just needs its datfrozenxid advance by > > one we're in trouble: vac_truncate_clog() will be called with minMulti = > > GetOldestMultiXactId(). > > Uh, no, the cutoff is GetOldestMultiXactId minus > vacuum_multixact_freeze_min_age (or the autovac equivalent). > See vacuum_set_xid_limits. Unfortunately not :(. The stuff computed in vacuum_set_xid_limits isn't used for truncation - which normally is sane because another database might be further behind. void vac_update_datfrozenxid(void) { ... MultiXactId newMinMulti; ... /* * Similarly, initialize the MultiXact "min" with the value that would be * used on pg_class for new tables. See AddNewRelationTuple(). */ newMinMulti = GetOldestMultiXactId(); ... while ((classTup = systable_getnext(scan)) != NULL) { ... if (MultiXactIdPrecedes(classForm->relminmxid, newMinMulti)) newMinMulti = classForm->relminmxid; } ... if (dirty || ForceTransactionIdLimitUpdate()) vac_truncate_clog(newFrozenXid, newMinMulti); } But even if it were, I don't really see how that changes anything? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-07-20 19:11:40 -0400, Tom Lane wrote: > I wrote: > > Andres Freund <andres@2ndquadrant.com> writes: > >> If any *single* full table vacuum after that calls > >> vac_update_datfrozenxid() which just needs its datfrozenxid advance by > >> one we're in trouble: vac_truncate_clog() will be called with minMulti = > >> GetOldestMultiXactId(). > > > Uh, no, the cutoff is GetOldestMultiXactId minus > > vacuum_multixact_freeze_min_age (or the autovac equivalent). > > Oh, wait, I see what you're on about: that cutoff doesn't constrain > the global value computed by vac_truncate_clog. Hmm. > > I wonder whether we should change vac_update_relstats so that it only > applies the "relminxid mustn't go backwards" rule as long as relminxid > is sane, ie, not in the future. If it is in the future, forcibly update > it to the cutoff we actually used. Likewise for relminmxid. And I guess > we'd need a similar rule for updating datmin(m)xid. I'm wondering the same. How would we do that from a concurreny POV for the pg_database rows? I think we could just accept the race condition that two xacts move dbform->datminmxid backwards to different values since both have to be 'somewhat' correct? I think this is out of the remit for 9.3.5. At least I don't have the mental capacity to do this properly till tomorrow afternoon. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-07-20 19:11:40 -0400, Tom Lane wrote: >> I wonder whether we should change vac_update_relstats so that it only >> applies the "relminxid mustn't go backwards" rule as long as relminxid >> is sane, ie, not in the future. If it is in the future, forcibly update >> it to the cutoff we actually used. Likewise for relminmxid. And I guess >> we'd need a similar rule for updating datmin(m)xid. > I'm wondering the same. How would we do that from a concurreny POV for > the pg_database rows? I think we could just accept the race condition > that two xacts move dbform->datminmxid backwards to different values > since both have to be 'somewhat' correct? Seems no worse than it is today. Is it even possible for two xacts to be trying to update that at the same time? > I think this is out of the remit for 9.3.5. At least I don't have the > mental capacity to do this properly till tomorrow afternoon. Doesn't sound that hard to me (and I'm still reasonably awake, which I bet you're not). Will take a look. regards, tom lane
On 2014-07-20 19:37:15 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-07-20 19:11:40 -0400, Tom Lane wrote: > >> I wonder whether we should change vac_update_relstats so that it only > >> applies the "relminxid mustn't go backwards" rule as long as relminxid > >> is sane, ie, not in the future. If it is in the future, forcibly update > >> it to the cutoff we actually used. Likewise for relminmxid. And I guess > >> we'd need a similar rule for updating datmin(m)xid. > > > I'm wondering the same. How would we do that from a concurreny POV for > > the pg_database rows? I think we could just accept the race condition > > that two xacts move dbform->datminmxid backwards to different values > > since both have to be 'somewhat' correct? > > Seems no worse than it is today. Well, right now it can only advance, never go backwards. I think. But it don't see how it could matter - all the possible minimum values better be correct. > Is it even possible for two xacts to be trying to update that at the > same time? At least I don't see anything prohibiting it. vac_update_datfrozenxid()( is called straight from vacuum()/do_autovacuum(). The only thing that'll serialize that I can see is the buffer lock for the heap_inplace_update(). Seems odd from a concurrency perspective. Worth a look someday not too far away. > > I think this is out of the remit for 9.3.5. At least I don't have the > > mental capacity to do this properly till tomorrow afternoon. > > Doesn't sound that hard to me (and I'm still reasonably awake, which > I bet you're not). I'm most definitely not awake, right ;). Especially as it's 32 °C in my flat - at 2 in the morning... > Will take a look. Cool. Will take a look at the result tomorrow. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-07-20 19:37:15 -0400, Tom Lane wrote: >>> I wonder whether we should change vac_update_relstats so that it only >>> applies the "relminxid mustn't go backwards" rule as long as relminxid >>> is sane, ie, not in the future. If it is in the future, forcibly update >>> it to the cutoff we actually used. Likewise for relminmxid. And I guess >>> we'd need a similar rule for updating datmin(m)xid. >> I'm wondering the same. Here's a draft patch for this. I think this will fix all cases where the "1" minmxid inserted by previous pg_upgrade versions is actually in the future at the time we run VACUUM. We would still be at risk if it had been in the future when pg_upgrade ran but no longer is now, since that would mean there could be non-lock-only mxids on disk that are older than "1". However, for the reasons discussed upthread, it seems fairly unlikely to me that people would actually get burnt in practice, so I'm satisfied with doing this much and no more. I noticed while poking at it that there was an additional small logic error in vac_update_datfrozenxid: if we decide to update only one of datfrozenxid and datminmxid, the value passed to vac_truncate_clog for the other field was *not* the value stored in pg_database but the value we'd decided not to use. That didn't seem like a good thing; it'd at least result in possibly using MyDatabaseId as oldestxid_datoid or minmulti_datoid for a value that doesn't actually match our database. Barring objections I'll commit this tomorrow. regards, tom lane diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 8822a15..3f72d5a 100644 *** a/src/backend/commands/vacuum.c --- b/src/backend/commands/vacuum.c *************** vac_update_relstats(Relation relation, *** 733,751 **** } /* ! * relfrozenxid should never go backward. Caller can pass ! * InvalidTransactionId if it has no new data. */ if (TransactionIdIsNormal(frozenxid) && ! TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid)) { pgcform->relfrozenxid = frozenxid; dirty = true; } ! /* relminmxid must never go backward, either */ if (MultiXactIdIsValid(minmulti) && ! MultiXactIdPrecedes(pgcform->relminmxid, minmulti)) { pgcform->relminmxid = minmulti; dirty = true; --- 733,765 ---- } /* ! * Update relfrozenxid, unless caller passed InvalidTransactionId ! * indicating it has no new data. ! * ! * Ordinarily, we don't let relfrozenxid go backwards: if things are ! * working correctly, the only way the new frozenxid could be older would ! * be if a previous VACUUM was done with a tighter freeze_min_age, in ! * which case we don't want to forget the work it already did. However, ! * if the stored relfrozenxid is "in the future", then it must be corrupt ! * and it seems best to overwrite it with the cutoff we used this time. ! * See vac_update_datfrozenxid() concerning what we consider to be "in the ! * future". */ if (TransactionIdIsNormal(frozenxid) && ! pgcform->relfrozenxid != frozenxid && ! (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) || ! TransactionIdPrecedes(GetOldestXmin(NULL, true), ! pgcform->relfrozenxid))) { pgcform->relfrozenxid = frozenxid; dirty = true; } ! /* Similarly for relminmxid */ if (MultiXactIdIsValid(minmulti) && ! pgcform->relminmxid != minmulti && ! (MultiXactIdPrecedes(pgcform->relminmxid, minmulti) || ! MultiXactIdPrecedes(GetOldestMultiXactId(), pgcform->relminmxid))) { pgcform->relminmxid = minmulti; dirty = true; *************** vac_update_relstats(Relation relation, *** 772,779 **** * truncate pg_clog and pg_multixact. * * We violate transaction semantics here by overwriting the database's ! * existing pg_database tuple with the new value. This is reasonably ! * safe since the new value is correct whether or not this transaction * commits. As with vac_update_relstats, this avoids leaving dead tuples * behind after a VACUUM. */ --- 786,793 ---- * truncate pg_clog and pg_multixact. * * We violate transaction semantics here by overwriting the database's ! * existing pg_database tuple with the new values. This is reasonably ! * safe since the new values are correct whether or not this transaction * commits. As with vac_update_relstats, this avoids leaving dead tuples * behind after a VACUUM. */ *************** vac_update_datfrozenxid(void) *** 786,792 **** --- 800,808 ---- SysScanDesc scan; HeapTuple classTup; TransactionId newFrozenXid; + TransactionId lastSaneFrozenXid; MultiXactId newMinMulti; + MultiXactId lastSaneMinMulti; bool dirty = false; /* *************** vac_update_datfrozenxid(void) *** 795,807 **** * committed pg_class entries for new tables; see AddNewRelationTuple(). * So we cannot produce a wrong minimum by starting with this. */ ! newFrozenXid = GetOldestXmin(NULL, true); /* * Similarly, initialize the MultiXact "min" with the value that would be * used on pg_class for new tables. See AddNewRelationTuple(). */ ! newMinMulti = GetOldestMultiXactId(); /* * We must seqscan pg_class to find the minimum Xid, because there is no --- 811,823 ---- * committed pg_class entries for new tables; see AddNewRelationTuple(). * So we cannot produce a wrong minimum by starting with this. */ ! newFrozenXid = lastSaneFrozenXid = GetOldestXmin(NULL, true); /* * Similarly, initialize the MultiXact "min" with the value that would be * used on pg_class for new tables. See AddNewRelationTuple(). */ ! newMinMulti = lastSaneMinMulti = GetOldestMultiXactId(); /* * We must seqscan pg_class to find the minimum Xid, because there is no *************** vac_update_datfrozenxid(void) *** 842,847 **** --- 858,873 ---- Assert(TransactionIdIsNormal(newFrozenXid)); Assert(MultiXactIdIsValid(newMinMulti)); + /* + * It's possible that every entry in pg_class is "in the future" (bugs in + * pg_upgrade have been known to produce such situations). Don't update + * pg_database, nor truncate CLOG, unless we found at least one + * sane-looking entry. + */ + if (!TransactionIdPrecedes(newFrozenXid, lastSaneFrozenXid) || + !MultiXactIdPrecedes(newMinMulti, lastSaneMinMulti)) + return; + /* Now fetch the pg_database tuple we need to update. */ relation = heap_open(DatabaseRelationId, RowExclusiveLock); *************** vac_update_datfrozenxid(void) *** 852,872 **** dbform = (Form_pg_database) GETSTRUCT(tuple); /* ! * Don't allow datfrozenxid to go backward (probably can't happen anyway); ! * and detect the common case where it doesn't go forward either. */ ! if (TransactionIdPrecedes(dbform->datfrozenxid, newFrozenXid)) { dbform->datfrozenxid = newFrozenXid; dirty = true; } ! /* ditto */ ! if (MultiXactIdPrecedes(dbform->datminmxid, newMinMulti)) { dbform->datminmxid = newMinMulti; dirty = true; } if (dirty) heap_inplace_update(relation, tuple); --- 878,907 ---- dbform = (Form_pg_database) GETSTRUCT(tuple); /* ! * As in vac_update_relstats(), we ordinarily don't want to let ! * datfrozenxid go backward; but if it's "in the future" then it must be ! * corrupt and it seems best to overwrite it. */ ! if (dbform->datfrozenxid != newFrozenXid && ! (TransactionIdPrecedes(dbform->datfrozenxid, newFrozenXid) || ! TransactionIdPrecedes(lastSaneFrozenXid, dbform->datfrozenxid))) { dbform->datfrozenxid = newFrozenXid; dirty = true; } + else + newFrozenXid = dbform->datfrozenxid; ! /* Ditto for datminmxid */ ! if (dbform->datminmxid != newMinMulti && ! (MultiXactIdPrecedes(dbform->datminmxid, newMinMulti) || ! MultiXactIdPrecedes(lastSaneMinMulti, dbform->datminmxid))) { dbform->datminmxid = newMinMulti; dirty = true; } + else + newMinMulti = dbform->datminmxid; if (dirty) heap_inplace_update(relation, tuple); *************** vac_update_datfrozenxid(void) *** 875,883 **** heap_close(relation, RowExclusiveLock); /* ! * If we were able to advance datfrozenxid, see if we can truncate ! * pg_clog. Also do it if the shared XID-wrap-limit info is stale, since ! * this action will update that too. */ if (dirty || ForceTransactionIdLimitUpdate()) vac_truncate_clog(newFrozenXid, newMinMulti); --- 910,918 ---- heap_close(relation, RowExclusiveLock); /* ! * If we were able to advance datfrozenxid or datminmxid, see if we can ! * truncate pg_clog and/or pg_multixact. Also do it if the shared ! * XID-wrap-limit info is stale, since this action will update that too. */ if (dirty || ForceTransactionIdLimitUpdate()) vac_truncate_clog(newFrozenXid, newMinMulti); *************** vac_update_datfrozenxid(void) *** 890,902 **** * Scan pg_database to determine the system-wide oldest datfrozenxid, * and use it to truncate the transaction commit log (pg_clog). * Also update the XID wrap limit info maintained by varsup.c. * ! * The passed XID is simply the one I just wrote into my pg_database ! * entry. It's used to initialize the "min" calculation. * * This routine is only invoked when we've managed to change our ! * DB's datfrozenxid entry, or we found that the shared XID-wrap-limit ! * info is stale. */ static void vac_truncate_clog(TransactionId frozenXID, MultiXactId minMulti) --- 925,938 ---- * Scan pg_database to determine the system-wide oldest datfrozenxid, * and use it to truncate the transaction commit log (pg_clog). * Also update the XID wrap limit info maintained by varsup.c. + * Likewise for datminmxid. * ! * The passed XID and minMulti are the updated values for my own ! * pg_database entry. They're used to initialize the "min" calculations. * * This routine is only invoked when we've managed to change our ! * DB's datfrozenxid/datminmxid values, or we found that the shared ! * XID-wrap-limit info is stale. */ static void vac_truncate_clog(TransactionId frozenXID, MultiXactId minMulti) *************** vac_truncate_clog(TransactionId frozenXI *** 909,920 **** Oid minmulti_datoid; bool frozenAlreadyWrapped = false; ! /* init oldest datoids to sync with my frozen values */ oldestxid_datoid = MyDatabaseId; minmulti_datoid = MyDatabaseId; /* ! * Scan pg_database to compute the minimum datfrozenxid * * Note: we need not worry about a race condition with new entries being * inserted by CREATE DATABASE. Any such entry will have a copy of some --- 945,956 ---- Oid minmulti_datoid; bool frozenAlreadyWrapped = false; ! /* init oldest datoids to sync with my frozenXID/minMulti values */ oldestxid_datoid = MyDatabaseId; minmulti_datoid = MyDatabaseId; /* ! * Scan pg_database to compute the minimum datfrozenxid/datminmxid * * Note: we need not worry about a race condition with new entries being * inserted by CREATE DATABASE. Any such entry will have a copy of some
I wrote: > Here's a draft patch for this. I think this will fix all cases where > the "1" minmxid inserted by previous pg_upgrade versions is actually > in the future at the time we run VACUUM. We would still be at risk if > it had been in the future when pg_upgrade ran but no longer is now, > since that would mean there could be non-lock-only mxids on disk that > are older than "1". However, for the reasons discussed upthread, it > seems fairly unlikely to me that people would actually get burnt in > practice, so I'm satisfied with doing this much and no more. Ah, belay that: as coded, that would allow truncation of clog/multixact as soon as any one relation in any one database had sane frozenxid/minmxid. If we want to have any pretense of being safe, we have to postpone truncation until *all* relations have been vacuumed. So more like the attached, instead. regards, tom lane diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c index 8822a15..ec9a7b6 100644 *** a/src/backend/commands/vacuum.c --- b/src/backend/commands/vacuum.c *************** static BufferAccessStrategy vac_strategy *** 66,72 **** /* non-export function prototypes */ static List *get_rel_oids(Oid relid, const RangeVar *vacrel); ! static void vac_truncate_clog(TransactionId frozenXID, MultiXactId minMulti); static bool vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, bool for_wraparound); --- 66,75 ---- /* non-export function prototypes */ static List *get_rel_oids(Oid relid, const RangeVar *vacrel); ! static void vac_truncate_clog(TransactionId frozenXID, ! MultiXactId minMulti, ! TransactionId lastSaneFrozenXid, ! MultiXactId lastSaneMinMulti); static bool vacuum_rel(Oid relid, VacuumStmt *vacstmt, bool do_toast, bool for_wraparound); *************** vac_update_relstats(Relation relation, *** 733,751 **** } /* ! * relfrozenxid should never go backward. Caller can pass ! * InvalidTransactionId if it has no new data. */ if (TransactionIdIsNormal(frozenxid) && ! TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid)) { pgcform->relfrozenxid = frozenxid; dirty = true; } ! /* relminmxid must never go backward, either */ if (MultiXactIdIsValid(minmulti) && ! MultiXactIdPrecedes(pgcform->relminmxid, minmulti)) { pgcform->relminmxid = minmulti; dirty = true; --- 736,768 ---- } /* ! * Update relfrozenxid, unless caller passed InvalidTransactionId ! * indicating it has no new data. ! * ! * Ordinarily, we don't let relfrozenxid go backwards: if things are ! * working correctly, the only way the new frozenxid could be older would ! * be if a previous VACUUM was done with a tighter freeze_min_age, in ! * which case we don't want to forget the work it already did. However, ! * if the stored relfrozenxid is "in the future", then it must be corrupt ! * and it seems best to overwrite it with the cutoff we used this time. ! * See vac_update_datfrozenxid() concerning what we consider to be "in the ! * future". */ if (TransactionIdIsNormal(frozenxid) && ! pgcform->relfrozenxid != frozenxid && ! (TransactionIdPrecedes(pgcform->relfrozenxid, frozenxid) || ! TransactionIdPrecedes(GetOldestXmin(NULL, true), ! pgcform->relfrozenxid))) { pgcform->relfrozenxid = frozenxid; dirty = true; } ! /* Similarly for relminmxid */ if (MultiXactIdIsValid(minmulti) && ! pgcform->relminmxid != minmulti && ! (MultiXactIdPrecedes(pgcform->relminmxid, minmulti) || ! MultiXactIdPrecedes(GetOldestMultiXactId(), pgcform->relminmxid))) { pgcform->relminmxid = minmulti; dirty = true; *************** vac_update_relstats(Relation relation, *** 772,779 **** * truncate pg_clog and pg_multixact. * * We violate transaction semantics here by overwriting the database's ! * existing pg_database tuple with the new value. This is reasonably ! * safe since the new value is correct whether or not this transaction * commits. As with vac_update_relstats, this avoids leaving dead tuples * behind after a VACUUM. */ --- 789,796 ---- * truncate pg_clog and pg_multixact. * * We violate transaction semantics here by overwriting the database's ! * existing pg_database tuple with the new values. This is reasonably ! * safe since the new values are correct whether or not this transaction * commits. As with vac_update_relstats, this avoids leaving dead tuples * behind after a VACUUM. */ *************** vac_update_datfrozenxid(void) *** 786,792 **** --- 803,812 ---- SysScanDesc scan; HeapTuple classTup; TransactionId newFrozenXid; + TransactionId lastSaneFrozenXid; MultiXactId newMinMulti; + MultiXactId lastSaneMinMulti; + bool bogus = false; bool dirty = false; /* *************** vac_update_datfrozenxid(void) *** 795,807 **** * committed pg_class entries for new tables; see AddNewRelationTuple(). * So we cannot produce a wrong minimum by starting with this. */ ! newFrozenXid = GetOldestXmin(NULL, true); /* * Similarly, initialize the MultiXact "min" with the value that would be * used on pg_class for new tables. See AddNewRelationTuple(). */ ! newMinMulti = GetOldestMultiXactId(); /* * We must seqscan pg_class to find the minimum Xid, because there is no --- 815,827 ---- * committed pg_class entries for new tables; see AddNewRelationTuple(). * So we cannot produce a wrong minimum by starting with this. */ ! newFrozenXid = lastSaneFrozenXid = GetOldestXmin(NULL, true); /* * Similarly, initialize the MultiXact "min" with the value that would be * used on pg_class for new tables. See AddNewRelationTuple(). */ ! newMinMulti = lastSaneMinMulti = GetOldestMultiXactId(); /* * We must seqscan pg_class to find the minimum Xid, because there is no *************** vac_update_datfrozenxid(void) *** 828,833 **** --- 848,868 ---- Assert(TransactionIdIsNormal(classForm->relfrozenxid)); Assert(MultiXactIdIsValid(classForm->relminmxid)); + /* + * If things are working properly, no relation should have a + * relfrozenxid or relminmxid that is "in the future". However, such + * cases have been known to arise due to bugs in pg_upgrade. If we + * see any entries that are "in the future", chicken out and don't do + * anything. This ensures we won't truncate clog before those + * relations have been scanned and cleaned up. + */ + if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) || + MultiXactIdPrecedes(lastSaneMinMulti, classForm->relminmxid)) + { + bogus = true; + break; + } + if (TransactionIdPrecedes(classForm->relfrozenxid, newFrozenXid)) newFrozenXid = classForm->relfrozenxid; *************** vac_update_datfrozenxid(void) *** 839,844 **** --- 874,883 ---- systable_endscan(scan); heap_close(relation, AccessShareLock); + /* chicken out if bogus data found */ + if (bogus) + return; + Assert(TransactionIdIsNormal(newFrozenXid)); Assert(MultiXactIdIsValid(newMinMulti)); *************** vac_update_datfrozenxid(void) *** 852,872 **** dbform = (Form_pg_database) GETSTRUCT(tuple); /* ! * Don't allow datfrozenxid to go backward (probably can't happen anyway); ! * and detect the common case where it doesn't go forward either. */ ! if (TransactionIdPrecedes(dbform->datfrozenxid, newFrozenXid)) { dbform->datfrozenxid = newFrozenXid; dirty = true; } ! /* ditto */ ! if (MultiXactIdPrecedes(dbform->datminmxid, newMinMulti)) { dbform->datminmxid = newMinMulti; dirty = true; } if (dirty) heap_inplace_update(relation, tuple); --- 891,920 ---- dbform = (Form_pg_database) GETSTRUCT(tuple); /* ! * As in vac_update_relstats(), we ordinarily don't want to let ! * datfrozenxid go backward; but if it's "in the future" then it must be ! * corrupt and it seems best to overwrite it. */ ! if (dbform->datfrozenxid != newFrozenXid && ! (TransactionIdPrecedes(dbform->datfrozenxid, newFrozenXid) || ! TransactionIdPrecedes(lastSaneFrozenXid, dbform->datfrozenxid))) { dbform->datfrozenxid = newFrozenXid; dirty = true; } + else + newFrozenXid = dbform->datfrozenxid; ! /* Ditto for datminmxid */ ! if (dbform->datminmxid != newMinMulti && ! (MultiXactIdPrecedes(dbform->datminmxid, newMinMulti) || ! MultiXactIdPrecedes(lastSaneMinMulti, dbform->datminmxid))) { dbform->datminmxid = newMinMulti; dirty = true; } + else + newMinMulti = dbform->datminmxid; if (dirty) heap_inplace_update(relation, tuple); *************** vac_update_datfrozenxid(void) *** 875,886 **** heap_close(relation, RowExclusiveLock); /* ! * If we were able to advance datfrozenxid, see if we can truncate ! * pg_clog. Also do it if the shared XID-wrap-limit info is stale, since ! * this action will update that too. */ if (dirty || ForceTransactionIdLimitUpdate()) ! vac_truncate_clog(newFrozenXid, newMinMulti); } --- 923,935 ---- heap_close(relation, RowExclusiveLock); /* ! * If we were able to advance datfrozenxid or datminmxid, see if we can ! * truncate pg_clog and/or pg_multixact. Also do it if the shared ! * XID-wrap-limit info is stale, since this action will update that too. */ if (dirty || ForceTransactionIdLimitUpdate()) ! vac_truncate_clog(newFrozenXid, newMinMulti, ! lastSaneFrozenXid, lastSaneMinMulti); } *************** vac_update_datfrozenxid(void) *** 890,905 **** * Scan pg_database to determine the system-wide oldest datfrozenxid, * and use it to truncate the transaction commit log (pg_clog). * Also update the XID wrap limit info maintained by varsup.c. * ! * The passed XID is simply the one I just wrote into my pg_database ! * entry. It's used to initialize the "min" calculation. * * This routine is only invoked when we've managed to change our ! * DB's datfrozenxid entry, or we found that the shared XID-wrap-limit ! * info is stale. */ static void ! vac_truncate_clog(TransactionId frozenXID, MultiXactId minMulti) { TransactionId myXID = GetCurrentTransactionId(); Relation relation; --- 939,960 ---- * Scan pg_database to determine the system-wide oldest datfrozenxid, * and use it to truncate the transaction commit log (pg_clog). * Also update the XID wrap limit info maintained by varsup.c. + * Likewise for datminmxid. * ! * The passed frozenXID and minMulti are the updated values for my own ! * pg_database entry. They're used to initialize the "min" calculations. ! * The caller also passes the "last sane" XID and MXID, since it has ! * those at hand already. * * This routine is only invoked when we've managed to change our ! * DB's datfrozenxid/datminmxid values, or we found that the shared ! * XID-wrap-limit info is stale. */ static void ! vac_truncate_clog(TransactionId frozenXID, ! MultiXactId minMulti, ! TransactionId lastSaneFrozenXid, ! MultiXactId lastSaneMinMulti) { TransactionId myXID = GetCurrentTransactionId(); Relation relation; *************** vac_truncate_clog(TransactionId frozenXI *** 907,920 **** HeapTuple tuple; Oid oldestxid_datoid; Oid minmulti_datoid; bool frozenAlreadyWrapped = false; ! /* init oldest datoids to sync with my frozen values */ oldestxid_datoid = MyDatabaseId; minmulti_datoid = MyDatabaseId; /* ! * Scan pg_database to compute the minimum datfrozenxid * * Note: we need not worry about a race condition with new entries being * inserted by CREATE DATABASE. Any such entry will have a copy of some --- 962,976 ---- HeapTuple tuple; Oid oldestxid_datoid; Oid minmulti_datoid; + bool bogus = false; bool frozenAlreadyWrapped = false; ! /* init oldest datoids to sync with my frozenXID/minMulti values */ oldestxid_datoid = MyDatabaseId; minmulti_datoid = MyDatabaseId; /* ! * Scan pg_database to compute the minimum datfrozenxid/datminmxid * * Note: we need not worry about a race condition with new entries being * inserted by CREATE DATABASE. Any such entry will have a copy of some *************** vac_truncate_clog(TransactionId frozenXI *** 936,941 **** --- 992,1010 ---- Assert(TransactionIdIsNormal(dbform->datfrozenxid)); Assert(MultiXactIdIsValid(dbform->datminmxid)); + /* + * If things are working properly, no database should have a + * datfrozenxid or datminmxid that is "in the future". However, such + * cases have been known to arise due to bugs in pg_upgrade. If we + * see any entries that are "in the future", chicken out and don't do + * anything. This ensures we won't truncate clog before those + * databases have been scanned and cleaned up. (We will issue the + * "already wrapped" warning if appropriate, though.) + */ + if (TransactionIdPrecedes(lastSaneFrozenXid, dbform->datfrozenxid) || + MultiXactIdPrecedes(lastSaneMinMulti, dbform->datminmxid)) + bogus = true; + if (TransactionIdPrecedes(myXID, dbform->datfrozenxid)) frozenAlreadyWrapped = true; else if (TransactionIdPrecedes(dbform->datfrozenxid, frozenXID)) *************** vac_truncate_clog(TransactionId frozenXI *** 969,974 **** --- 1038,1047 ---- return; } + /* chicken out if data is bogus in any other way */ + if (bogus) + return; + /* * Truncate CLOG to the oldest computed value. Note we don't truncate * multixacts; that will be done by the next checkpoint.
Thanks for spending some time on this issue. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On 2014-07-20 23:19:30 -0400, Tom Lane wrote: > I wrote: > > Here's a draft patch for this. I think this will fix all cases where > > the "1" minmxid inserted by previous pg_upgrade versions is actually > > in the future at the time we run VACUUM. We would still be at risk if > > it had been in the future when pg_upgrade ran but no longer is now, > > since that would mean there could be non-lock-only mxids on disk that > > are older than "1". However, for the reasons discussed upthread, it > > seems fairly unlikely to me that people would actually get burnt in > > practice, so I'm satisfied with doing this much and no more. > > Ah, belay that: as coded, that would allow truncation of clog/multixact > as soon as any one relation in any one database had sane > frozenxid/minmxid. If we want to have any pretense of being safe, we have > to postpone truncation until *all* relations have been vacuumed. So more > like the attached, instead. Looks basically good to me. Two minor things: 1) How about adding a elog(WARNING, "relation %u has invalid freeze limits", ...) to vac_update_datfrozenxid()? Otherwise it's a bit hard to understand what's going on. I wouldn't want to invest too much effort into the precision of the message, but some log bleat seems justified. 2) From a pedantic POV lastSane* should be initialized to ReadNewTransactionId()/ReadNextMultiXactId(). On a busy system we could otherwise consider concurrently updated values as being too new if some longrunning transaction finishes. Also GetOldestXmin() can sometimes go backwards... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > 1) How about adding a elog(WARNING, "relation %u has invalid freeze > limits", ...) to vac_update_datfrozenxid()? Otherwise it's a bit hard > to understand what's going on. I wouldn't want to invest too much effort > into the precision of the message, but some log bleat seems justified. I considered that but wasn't really convinced it was a good idea. You'd be getting a lot of log spam on a 9.3 installation that had been affected by the pg_upgrade bug, until all the tables got cleaned up (and we're intentionally not forcing that to happen quickly). Anybody else have an opinion? > 2) From a pedantic POV lastSane* should be initialized to > ReadNewTransactionId()/ReadNextMultiXactId(). On a busy system we could > otherwise consider concurrently updated values as being too new if some > longrunning transaction finishes. Also GetOldestXmin() can sometimes go > backwards... If it's wrong then that's a pre-existing bug, but I don't think it's wrong. regards, tom lane
On 2014-07-21 12:20:56 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > 2) From a pedantic POV lastSane* should be initialized to > > ReadNewTransactionId()/ReadNextMultiXactId(). On a busy system we could > > otherwise consider concurrently updated values as being too new if some > > longrunning transaction finishes. Also GetOldestXmin() can sometimes go > > backwards... > > If it's wrong then that's a pre-existing bug, but I don't think it's > wrong. I'm not wondering so much about vac_update_relstats(). There indeed the calls aren't new and the worst that can happen is a slightly older freeze limit. I'm more wondering about vac_update_datfrozenxid(). Afaics we very well could hit newFrozenXid = lastSaneFrozenXid = GetOldestXmin(NULL, true); newMinMulti = lastSaneMinMulti = GetOldestMultiXactId(); ... /* * If things are working properly, no relation should have a * relfrozenxid or relminmxid that is "in the future". However, such * cases have been known to arise due to bugs in pg_upgrade. If we * see any entries that are "in the future", chicken out and don't do * anything. This ensures we won't truncate clog before those * relations have been scanned and cleaned up. */ if (TransactionIdPrecedes(lastSaneFrozenXid, classForm->relfrozenxid) || MultiXactIdPrecedes(lastSaneMinMulti, classForm->relminmxid)) { bogus = true; break; } for a relation that has just been vacuumed by another backend. In a database with a high number of pg_class entries/much bloat it's not that unrealistic that relfrozenxid/relminmxid have been updated since the GetOldest* calls above. Since vac_update_relstats() writes them using heap_inplace_update() the window for that isn't particularly small. That could potentially lead to never updating the database freeze limits, right? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > I'm not wondering so much about vac_update_relstats(). There indeed the > calls aren't new and the worst that can happen is a slightly older > freeze limit. I'm more wondering about vac_update_datfrozenxid(). Afaics > we very well could hit > newFrozenXid = lastSaneFrozenXid = GetOldestXmin(NULL, true); > newMinMulti = lastSaneMinMulti = GetOldestMultiXactId(); > for a relation that has just been vacuumed by another backend. Hmm ... I see. The issue is not what the computed minimum datfrozenxid etc should be; it's right to err in the backwards direction there. It's whether we want to declare that the calculation is bogus and abandon truncation if another session manages to sneak in a very-new relfrozenxid. Yeah, you're right, we need to be conservative about doing that. I'd wanted to avoid extra calls here but I guess we have to do them after all. Will fix. regards, tom lane
On 2014-07-21 12:43:24 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > I'm not wondering so much about vac_update_relstats(). There indeed the > > calls aren't new and the worst that can happen is a slightly older > > freeze limit. I'm more wondering about vac_update_datfrozenxid(). Afaics > > we very well could hit > > newFrozenXid = lastSaneFrozenXid = GetOldestXmin(NULL, true); > > newMinMulti = lastSaneMinMulti = GetOldestMultiXactId(); > > for a relation that has just been vacuumed by another backend. > > Hmm ... I see. The issue is not what the computed minimum datfrozenxid > etc should be; it's right to err in the backwards direction there. > It's whether we want to declare that the calculation is bogus and abandon > truncation if another session manages to sneak in a very-new relfrozenxid. > Yeah, you're right, we need to be conservative about doing that. I'd > wanted to avoid extra calls here but I guess we have to do them after all. > Will fix. I wonder if GetTopTransactionId()/MultiXactIdSetOldestMember() and using lastSane* = ReadNew* isn't sufficient. After the xid assignment concurrent GetOldest* can't go below the ReadNew* values anymore, right? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-07-21 12:43:24 -0400, Tom Lane wrote: >> Will fix. > I wonder if GetTopTransactionId()/MultiXactIdSetOldestMember() and using > lastSane* = ReadNew* isn't sufficient. After the xid assignment > concurrent GetOldest* can't go below the ReadNew* values anymore, right? I don't see any point in being picky about it. What we want is to reject values that are conclusively bogus; things that are just close to bogus are not so interesting. regards, tom lane
On 2014-07-21 13:03:21 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2014-07-21 12:43:24 -0400, Tom Lane wrote: > >> Will fix. > > > I wonder if GetTopTransactionId()/MultiXactIdSetOldestMember() and using > > lastSane* = ReadNew* isn't sufficient. After the xid assignment > > concurrent GetOldest* can't go below the ReadNew* values anymore, right? > > I don't see any point in being picky about it. What we want is to reject > values that are conclusively bogus; things that are just close to bogus > are not so interesting. It'd just have been about optimizing away repeated calls to ReadNew*... Thanks. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2014-07-21 13:03:21 -0400, Tom Lane wrote: >> I don't see any point in being picky about it. What we want is to reject >> values that are conclusively bogus; things that are just close to bogus >> are not so interesting. > It'd just have been about optimizing away repeated calls to ReadNew*... Oh. I think the existing code is probably efficient enough about that. vac_update_datfrozenxid() only does it once anyway, and the calls in vac_update_relstats() are arranged so that we don't expect them to be hit at all in normal cases. regards, tom lane
On 21 July 2014 00:37, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I'm a bit inclined to not say anything about fix procedures in the release > notes, because I'm not sure that this is a problem in the field. If > anybody did have a wraparound they'd be getting bleats from VACUUM, and no > one has reported any such thing that I've heard. I hit the issue, but Andres helped on IRC (and so started this thread). Yes, VACUUM failing on a production server after the 9.1->9.3 upgrade was the problem. I'd assumed there was database corruption and I had some unreadable blocks, and probably would have just rebuilt the table if #postgresql hadn't been so helpful. -- Stuart Bishop <stuart@stuartbishop.net> http://www.stuartbishop.net/
On 2014-07-22 16:08:37 +0700, Stuart Bishop wrote: > On 21 July 2014 00:37, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > I'm a bit inclined to not say anything about fix procedures in the release > > notes, because I'm not sure that this is a problem in the field. If > > anybody did have a wraparound they'd be getting bleats from VACUUM, and no > > one has reported any such thing that I've heard. > > I hit the issue, but Andres helped on IRC (and so started this > thread). Yes, VACUUM failing on a production server after the 9.1->9.3 > upgrade was the problem. I'd assumed there was database corruption and > I had some unreadable blocks, and probably would have just rebuilt the > table if #postgresql hadn't been so helpful. This is actually a second problem that came up during the investigation of the problem you had - with different symptoms and causes... IIRC it's not what you've experienced. If you look at http://www.postgresql.org/docs/devel/static/release-9-3-5.html your problem should be described. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Sun, Jul 20, 2014 at 11:19:30PM -0400, Tom Lane wrote: > I wrote: > > Here's a draft patch for this. I think this will fix all cases where > > the "1" minmxid inserted by previous pg_upgrade versions is actually > > in the future at the time we run VACUUM. We would still be at risk if > > it had been in the future when pg_upgrade ran but no longer is now, > > since that would mean there could be non-lock-only mxids on disk that > > are older than "1". However, for the reasons discussed upthread, it > > seems fairly unlikely to me that people would actually get burnt in > > practice, so I'm satisfied with doing this much and no more. > > Ah, belay that: as coded, that would allow truncation of clog/multixact > as soon as any one relation in any one database had sane > frozenxid/minmxid. If we want to have any pretense of being safe, we have > to postpone truncation until *all* relations have been vacuumed. So more > like the attached, instead. Should we conclude that the multi-xact code is hopelessly complex and needs a redesign? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
On 2014-07-22 09:09:31 -0400, Bruce Momjian wrote: > On Sun, Jul 20, 2014 at 11:19:30PM -0400, Tom Lane wrote: > > I wrote: > > > Here's a draft patch for this. I think this will fix all cases where > > > the "1" minmxid inserted by previous pg_upgrade versions is actually > > > in the future at the time we run VACUUM. We would still be at risk if > > > it had been in the future when pg_upgrade ran but no longer is now, > > > since that would mean there could be non-lock-only mxids on disk that > > > are older than "1". However, for the reasons discussed upthread, it > > > seems fairly unlikely to me that people would actually get burnt in > > > practice, so I'm satisfied with doing this much and no more. > > > > Ah, belay that: as coded, that would allow truncation of clog/multixact > > as soon as any one relation in any one database had sane > > frozenxid/minmxid. If we want to have any pretense of being safe, we have > > to postpone truncation until *all* relations have been vacuumed. So more > > like the attached, instead. > > Should we conclude that the multi-xact code is hopelessly complex and > needs a redesign? That might be true, but I don't see the above as evidence of it. It's a nontrivial workaround for a bug; it's not surprising that it needs a couple iterations to make sense. Without the pg_upgrade bug there'd be no need to make those adjustments. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 22, 2014 at 04:14:01PM +0200, Andres Freund wrote: > On 2014-07-22 09:09:31 -0400, Bruce Momjian wrote: > > On Sun, Jul 20, 2014 at 11:19:30PM -0400, Tom Lane wrote: > > > I wrote: > > > > Here's a draft patch for this. I think this will fix all cases where > > > > the "1" minmxid inserted by previous pg_upgrade versions is actually > > > > in the future at the time we run VACUUM. We would still be at risk if > > > > it had been in the future when pg_upgrade ran but no longer is now, > > > > since that would mean there could be non-lock-only mxids on disk that > > > > are older than "1". However, for the reasons discussed upthread, it > > > > seems fairly unlikely to me that people would actually get burnt in > > > > practice, so I'm satisfied with doing this much and no more. > > > > > > Ah, belay that: as coded, that would allow truncation of clog/multixact > > > as soon as any one relation in any one database had sane > > > frozenxid/minmxid. If we want to have any pretense of being safe, we have > > > to postpone truncation until *all* relations have been vacuumed. So more > > > like the attached, instead. > > > > Should we conclude that the multi-xact code is hopelessly complex and > > needs a redesign? > > That might be true, but I don't see the above as evidence of it. It's a > nontrivial workaround for a bug; it's not surprising that it needs a > couple iterations to make sense. Without the pg_upgrade bug there'd be > no need to make those adjustments. Well, my point is that even Tom was confused about how things should be handled. Anyway, seems no one thinks a redesign is needed, but I had to ask. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +