Thread: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
tgarnett@panjiva.com
Date:
The following bug has been logged on the website: Bug reference: 12990 Logged by: Timothy Garnett Email address: tgarnett@panjiva.com PostgreSQL version: 9.3.5 Operating system: Ubuntu Linux x86_64 12.04.5 LTS Description: At 4/1 5:10pm our production database started throwing these kinds of errors (all of our hot spares were also similarly corrupted): ERROR: could not access status of transaction 303450738 DETAIL: Could not open file "pg_multixact/members/7B49": No such file or directory. This is possibly related to bug 8673 (though the mailing list sounds like most issues there were resolved by 9.3.4 and we are on 9.3.5) and also of note this db cluster was pg_upgraded from 9.2 at one point (to 9.3.x then further upgraded to 9.3.5)(see http://www.postgresql.org/message-id/CAAS3ty+2ynCyf_YmRn6WuqSF8EmJMDypAkc7uD_EXTZJ7usOSg@mail.gmail.com , though it doesn't seem like that's involved here). We have a file-system level snapshot of a hot spare of the db from about 28 hours before the errors started and sufficient wal files to do point in time recovery. We recovered a copy of the database to about 1 hour before the errors started and promoted that to our production db (losing a couple of hours of commits). Walking through the point in time recovery we see the pg_multixact/members folder fill up and eventually suffer a massive truncation when the newest file overtakes the oldest file. The range from oldest to newest (through the wrap point 14078) appears to be continuous. Time Newest-File Oldest-File 1 am 72BA 98A4(last Oct.) 10am 72BC 98A4 11am 72BD 98A4 12pm 7C4E 98A4 1 pm 7E88 98A4 2 pm 7FFF 98A4 3 pm 884E 98A4 4 pm 905F 98A4 * here we forked our new production 4:30 94AF 98D2 5pm 984D 98D2 5:15 98D2 9900 * now errors, all files outside of 98D2-9900 are gone During the rapid growth we were generating 10-12 pg_multixact/members files per minute. After we forked off a new production we identified the workload that was causing this rapid growth and stopped it. We then used vacuumdb -a -F to vacuum freeze the whole database cluster. Our presumption was that this would free up the members files, but it only freed up a few. The current oldest file is 9E30 (~Jan. 1st) so moved up from 98A4 and the newest file is 906A. We're concerned this might be a time-bomb waiting for us in the future (though at the slower growth rate of 16 files a day or so potentially a while in the future) as the members file namespace is still 95+% consumed post vacuum freeze. We do plan to upgrade to 9.4 sometime in the next couple of months and are curious if we can use pg_upgrade or if we will need to dump / restore the full (multi TiB) cluster. As for the workload causing the rapid growth, it involved something like: while [many millions of things to update / insert] BEGIN; SELECT state FROM table_A WHERE id = 1 FOR SHARE; if state != 'blocked' update / insert 2000-10000 rows in table_B, other stuff COMMIT; else COMMIT; sleep wait being run in a bunch of connections (26) to the db. Row 1 of table A was being used effectively as a share / exclusive lock as another different process would update the state to 'blocked' to block the first process in order to manage some shared outside the db state. We've retained the pre-failure snapshot and wal files for now (though we will need to free them up at some point) so we can point in time recover to any point from 28 hours before to several hours after the problem surfaced if that's helpful at all. Bugs / Questions: - the members files wrapped over themselves leading to corruption - why didn't vacuum_db -a -F free up more members files?
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Kevin Grittner
Date:
"tgarnett@panjiva.com" <tgarnett@panjiva.com> wrote: > - why didn't vacuum_db -a -F free up more members files? Are you sure that while the problem was developing the primary cluster didn't have any long-running transactions (including those left "idle in transaction" or prepared transactions)? Was there a checkpoint following the vacuumdb -a -F run? -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Timothy Garnett
Date:
> > > > - why didn't vacuum_db -a -F free up more members files? > > Are you sure that while the problem was developing the primary > cluster didn't have any long-running transactions (including those > left "idle in transaction" or prepared transactions)? Was there a > checkpoint following the vacuumdb -a -F run? The vacuum_db was run after we recovered from backup (which involved restarting the server) so at the time the vacuum_db started there were no open transactions. There have have been checkpoints since the vacuum_db finished as well. In the lead up to the problem there wouldn't have been any transactions open more then a couple of hours (the oldest members file was over 6 months old). Tim
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Bruce Momjian
Date:
On Mon, Apr 6, 2015 at 07:21:30PM +0000, tgarnett@panjiva.com wrote: > We've retained the pre-failure snapshot and wal files for now (though we > will need to free them up at some point) so we can point in time recover to > any point from 28 hours before to several hours after the problem surfaced > if that's helpful at all. > > Bugs / Questions: > - the members files wrapped over themselves leading to corruption > - why didn't vacuum_db -a -F free up more members files? Wow, you did a lot of research on this and I am glad you seem to have found a solution. Frankly, there were so many multi-xact bugs in 9.3.X that I am unable to track exactly what bugs caused what failures, when they were fixed, and whether fixes fixed the problem for good, or whether they just fixed them from happening in the future. Yes, very discouraging. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + Everyone has their own god. +
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
tgarnett@panjiva.com wrote: > ERROR: could not access status of transaction 303450738 > DETAIL: Could not open file "pg_multixact/members/7B49": No such file or > directory. Bruce and Kevin both pinged me about this issue recently. Turns out that I have an incomplete patch to close the problem. Just to clarify, this is a completely new problem, not related to #8673. The fix is to raise an ERROR when generating a new multixact, if we detect that doing so would get close to the oldest multixact that the system knows about. If that happens, the solution is to vacuum so that the "oldest" point is advanced a bit more and you have room to generate more multixacts. In production, you would typically adjust the multixact freeze parameters so that "oldest multixact" is advanced more aggressively and you don't hit the ERROR. A fix I pushed to master (for a performance regression reportes as bug #8470) would make the problem less common, by having typical multixact sizes be smaller. I didn't backpatch that fix due to lack of feedback, but since it is connected to this data-eating bug, maybe we should look into doing so. This problem only arises if your multixacts are larger than 24 members in average (or something like that. I don't recall the exact number.) That should be atypical, except that prior to the #8470 fix the multixact size is related to number of subtransactions doing certain operations in a loop. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Mon, Apr 13, 2015 at 5:08 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > tgarnett@panjiva.com wrote: > >> ERROR: could not access status of transaction 303450738 >> DETAIL: Could not open file "pg_multixact/members/7B49": No such file or >> directory. > > Bruce and Kevin both pinged me about this issue recently. Turns out > that I have an incomplete patch to close the problem. Just to clarify, > this is a completely new problem, not related to #8673. > > The fix is to raise an ERROR when generating a new multixact, if we > detect that doing so would get close to the oldest multixact that the > system knows about. I think we definitely need to do that ASAP. And possibly then force an immediate minor release. Bugs that eat your data are bad, and we have a customer hitting this completely independently of this report, which makes this look like more than a theoretical problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Timothy Garnett
Date:
On Wed, Apr 15, 2015 at 12:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > I think we definitely need to do that ASAP. And possibly then force > an immediate minor release. Bugs that eat your data are bad, and we > have a customer hitting this completely independently of this report, > which makes this look like more than a theoretical problem. > > I defer to others on what a good timeline would be, but failing harder here would be good. We were somewhat lucky in discovering the issue as fast as we did. Post wrap, 99.9+% of the queries to our database were returning fine without error. We only observed the failure when accessing two specific rows in two different tables (there were possibly more, it's a multi-TiB db cluster). Had we not hit those rows for a few days instead of a few hours recovery would have been extremely difficult as rolling back to a known good state wouldn't have really been an option. Tim
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Alvaro Herrera wrote: > The fix is to raise an ERROR when generating a new multixact, if we > detect that doing so would get close to the oldest multixact that the > system knows about. If that happens, the solution is to vacuum so that > the "oldest" point is advanced a bit more and you have room to generate > more multixacts. In production, you would typically adjust the > multixact freeze parameters so that "oldest multixact" is advanced more > aggressively and you don't hit the ERROR. Here's a patch. I have tested locally and it closes the issue for me. If those affected can confirm that it stops the file removal from happening, I'd appreciate it. It would be much better to avoid that additional file reading, but it seems difficult (read: I don't see how) without changing pg_control. Note: in bootstrap.c, I had to move setting the bootstrap a bit earlier. Otherwise, the is-in-bootstrap mode test returned false while bootstrapping multixact, so initdb would fail on the initial phase because of trying to read pg_multixact/offset/0000 which hasn't been created at that point. The amount of code that runs in the bootstrap mode after this change that wasn't running in bootstrap mode previously is pretty tiny and shouldn't cause any problem -- it's essentially the whole of BootStrapXLOG(). -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
Hi Alvaro On Tue, Apr 21, 2015 at 7:04 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Here's a patch. I have tested locally and it closes the issue for me. > If those affected can confirm that it stops the file removal from > happening, I'd appreciate it. I was also starting to look at this problem. For what it's worth, here's a client program that I used to generate a lot of multixact members. The patch seems to work correctly so far: as the offset approached wraparound, I saw the warnings first with appropriate OID and members remaining, and then I was blocked from creating new multixacts. Best regards, -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Amit Kapila
Date:
On Tue, Apr 21, 2015 at 12:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Alvaro Herrera wrote: > > > The fix is to raise an ERROR when generating a new multixact, if we > > detect that doing so would get close to the oldest multixact that the > > system knows about. If that happens, the solution is to vacuum so that > > the "oldest" point is advanced a bit more and you have room to generate > > more multixacts. In production, you would typically adjust the > > multixact freeze parameters so that "oldest multixact" is advanced more > > aggressively and you don't hit the ERROR. > > Here's a patch. I have tested locally and it closes the issue for me. > If those affected can confirm that it stops the file removal from > happening, I'd appreciate it. > 1. Do you think it makes sense to give warning in SetMultiXactIdLimit() if we have already reached offsetWarnLimit as we give for multiWarnLimit? 2. void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB) { if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti)) SetMultiXactIdLimit(oldestMulti, oldestMultiDB); + else + DetermineSafeOldestOffset(oldestMulti); } Why we need to update offset stop/warn limits for the above case? Won't it make the warning/error (pertaining to wrap around data loss) to appear before it is required. 3. @@ -1921,6 +1957,12 @@ StartupMultiXact(void) */ pageno = MXOffsetToMemberPage(offset); MultiXactMemberCtl->shared->latest_page_number = pageno; + + /* + * compute the oldest member we need to keep around to avoid old member + * data overrun. + */ + DetermineSafeOldestOffset(MultiXactState->oldestMultiXactId); } Do we need to try determining safeoldestoffset during startup considering that we don't allow it in recovery (StartupMultiXact() seems to be called only during recovery) 4. AuxiliaryProcessMain() { .. /* * XLOG operations */ SetProcessingMode(NormalProcessing); switch (MyAuxProcType) { case CheckerProcess: /* don't set signals, they're useless here */ CheckerModeMain(); proc_exit(1); /* should never return */ case BootstrapProcess: SetProcessingMode(BootstrapProcessing); bootstrap_signals(); BootStrapXLOG(); BootstrapModeMain(); proc_exit(1); /* should never return */ .. } Looking at above code, the new setting of processing mode for BootstrapProcessing looks slightly unlear, basically first we set processing mode as Normal and then set it to BootstrapProcessing, may be we can add a comment there. This solution seems like a good workaround for the problem reported, however ideally there should be some way (via Vacuum/ Checkpoint) with which this increase of files can be prevented. I think after your patch gets committed, we can try to devise a better design to overcome this problem. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Kevin Grittner
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Here's a patch. I have tested locally and it closes the issue > for me. If those affected can confirm that it stops the file > removal from happening, I'd appreciate it. Based on initial testing, it seems to stop file removal from happening rather too well. Before applying the patch I ran a test test that generated files 0000 to 1185D in the members directory. Even setting vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age very low, none of the members files would go away with VACUUM (with and without FREEZE) followed by CHECKPOINT. After applying the patch and starting with a fresh initdb, with very low settings of the vacuum_multixact_* GUCs I get the new error almost immediately, while the only file in the members subdirectory is 0000 and it is 8kB in size. I think the calculations might be wrong, but I'm not sure what does make sense. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Tue, Apr 21, 2015 at 12:25 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Hi Alvaro > > On Tue, Apr 21, 2015 at 7:04 AM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: >> Here's a patch. I have tested locally and it closes the issue for me. >> If those affected can confirm that it stops the file removal from >> happening, I'd appreciate it. > > I was also starting to look at this problem. For what it's worth, > here's a client program that I used to generate a lot of multixact > members. The patch seems to work correctly so far: as the offset > approached wraparound, I saw the warnings first with appropriate OID > and members remaining, and then I was blocked from creating new > multixacts. One thing I noticed about your patch is that it effectively halves the amount of multixact members you can have on disk. Sure, I'd rather hit an error at 2^31 members than a corrupt database at 2^32 members, but I wondered if we should try to allow the full range to be used. I'm not sure whether there is a valid use case for such massive amounts of pg_multixact/members data (or at least one that won't go away if autovacuum heuristics are changed in a later patch, also I understand that there are other recent patches that reduce member traffic), but I if the plan is to backpatch this patch then I suppose it should ideally not halve the amount of an important resource you can use in existing system when people do a point upgrade. Here's a small patch (that applies after your patch) to show how this could be done, using three-way comparisons with an explicit boundary to detect wraparound. There may be other technical problems (for example MultiXactAdvanceNextMXact still uses the MultiXactOffsetPrecedes), or this may be a bad idea just because it breaks with the well convention for wrap around detection established by xids. Also, I wanted to make sure I could reproduce the original bug/corruption in unpatched master with the client program I posted. Here are my notes on doing that (sorry if they belabour the obvious, partly this is just me learning how SLRUs and multixacts work...): ======== Member wraparound happens after segment file "14078" (assuming default page size, you get 32 pages per segment, and 1636 members per page (409 groups of 4 + some extra data), and our max member offset wraps after 0xffffffff, and 0xffffffff / 1636 / 32 = 82040 = 0x14078; incidentally that final segment is a shorter one). Using my test client with 500 sessions and 35k loops I observed this, it wrapped back around to writing to member file "0000" after creating "14078", which is obviously broken, because the start of member segment "0000" holds members for multixact ID 1, which was still in play (it was datminmxid for template0). Looking at the members of multixact ID 1 I see recent xids: postgres=# select pg_get_multixact_members('1'::xid); pg_get_multixact_members -------------------------- (34238661,sh) (34238662,sh) (2 rows) Note that pg_get_multixact_members knows the correct *number* of members for multixact ID 1, it's just that it's looking at members from some much later multixact. By a tedious binary search I found it: postgres=# select pg_get_multixact_members('17094780'::xid); pg_get_multixact_members -------------------------- ... snip ... (34238660,sh) (34238661,sh) <-- here they are! (34238662,sh) <-- (34238663,sh) ... snip ... After a checkpoint, I saw that all the files got deleted except for a few consecutively named files starting at "0000", which would be correct behavior in general, if we hadn't allowed the member offset to wrap. It had correctly kept the segments starting with the one holding the members of multixact ID 1 (the cluster-wide oldest) up until the one that corresponds to MultiXactState->nextOffset. My test program had blown right past member offset 0xffffffff and back to 0 and then kept going. The truncation code isn't the problem per se. To produce the specific error message seen by the bug reporter via normal interactions from a test program, I think we need some magic that I can't figure out how to do yet: we need to run a query that accesses a multixact that has member offset from before offset wraparound, eg 0xffffffff or similar, but whose members are not on a page that is still in memory, after a checkpoint that has unlinked the segment file, so it can try to load it and discover that the segment file is missing! So a pretty complex interaction of concurrent processes, timing and caches. We can more artificially stimulate the error by explicitly asking for multixact members like this though: postgres=# select pg_get_multixact_members('10000000'::xid); ERROR: could not access status of transaction 10000000 DETAIL: Could not open file "pg_multixact/members/BB55": No such file or directory. That's a totally valid multixact ID, obviously since it's been able to figure out which segment to look in for its members. Here's one that tries to open the segment that comes immediately before "0000" in modulo numbering: postgres=# select pg_get_multixact_members('17094770'::xid); ERROR: could not access status of transaction 17094770 DETAIL: Could not open file "pg_multixact/members/14078": No such file or directory. If I tried it with 17094779, the multixact ID immediatly before the one that has overwritten "0000", it does actually work, presumably because its pages happen to be buffered for me so it doesn't try to open the file (guessing here). I don't currently believe it's necessary to reproduce that step via a test program anyway, the root problem is clear enough just from watching the thing wrap. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Tue, Apr 21, 2015 at 5:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Here's a patch. I have tested locally and it closes the issue >> for me. If those affected can confirm that it stops the file >> removal from happening, I'd appreciate it. > > Based on initial testing, it seems to stop file removal from > happening rather too well. Before applying the patch I ran a test > test that generated files 0000 to 1185D in the members directory. > Even setting vacuum_multixact_freeze_min_age and > vacuum_multixact_freeze_table_age very low, none of the members > files would go away with VACUUM (with and without FREEZE) followed > by CHECKPOINT. After applying the patch and starting with a fresh > initdb, with very low settings of the vacuum_multixact_* GUCs I get > the new error almost immediately, while the only file in the > members subdirectory is 0000 and it is 8kB in size. > > I think the calculations might be wrong, but I'm not sure what does > make sense. Can anyone else reproduce this? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Fri, Apr 24, 2015 at 5:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 21, 2015 at 5:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>> Here's a patch. I have tested locally and it closes the issue >>> for me. If those affected can confirm that it stops the file >>> removal from happening, I'd appreciate it. >> >> Based on initial testing, it seems to stop file removal from >> happening rather too well. Before applying the patch I ran a test >> test that generated files 0000 to 1185D in the members directory. >> Even setting vacuum_multixact_freeze_min_age and >> vacuum_multixact_freeze_table_age very low, none of the members >> files would go away with VACUUM (with and without FREEZE) followed >> by CHECKPOINT. After applying the patch and starting with a fresh >> initdb, with very low settings of the vacuum_multixact_* GUCs I get >> the new error almost immediately, while the only file in the >> members subdirectory is 0000 and it is 8kB in size. >> >> I think the calculations might be wrong, but I'm not sure what does >> make sense. > > Can anyone else reproduce this? Yes. This happens in a fresh initdb-ed database. We start out with oldestOffset = 0, oldestOffsetStopLimit = 4294914944 (that's 0 - safety margin), nextOffset = 0. Then the first attempt to generate a new multixact ID fails, because MultiXactOffsetPrecedes(oldestOffsetStopLimit, nextOffset + nmembers) is true. I guess the patch works correctly when you start out with an unpatched database server and generate more than 2.whatever billion members, and then you restart with the patch, and then start adding more members until MultiXactOffsetPrecedes(...) returns true and you get the error. That's why I proposed not using xid-like logic, and instead using a type of three-way comparison that allows you to see when nextOffset would 'cross' oldestOffsetStopLimit, instead of the two-way comparison that considers half the number-space to be in the past and half in the future, in my earlier message. -- Thomas Munro http://www.enterprisedb.com
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Thomas Munro wrote: > That's why I proposed not using xid-like logic, and instead using a > type of three-way comparison that allows you to see when nextOffset > would 'cross' oldestOffsetStopLimit, instead of the two-way comparison > that considers half the number-space to be in the past and half in the > future, in my earlier message. Yeah, that bit made sense to me. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Thu, Apr 23, 2015 at 9:59 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Thomas Munro wrote: >> That's why I proposed not using xid-like logic, and instead using a >> type of three-way comparison that allows you to see when nextOffset >> would 'cross' oldestOffsetStopLimit, instead of the two-way comparison >> that considers half the number-space to be in the past and half in the >> future, in my earlier message. > > Yeah, that bit made sense to me. In addition to preventing the corruption, I think we also need a back-patchable fix for AV to try to keep this situation from happening in the first place. We do not want the system to get stuck in a situation where member usage is low, so autovacuum does nothing, but offset usage is high, so the user just keeps getting an error whenever they do anything that could cause a new MXID to be created. Perhaps if we'd thought of it we would have added yet another column to pg_class to track the oldest offset referenced by any MXID in a particular relation, but that sounds kind of icky on general principle and certainly wouldn't be back-patchable. What I think we should do is notice when members utilization exceeds offset utilization and progressively ramp back the effective value of autovacuum_multixact_freeze_max_age (and maybe also vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age) so that autovacuum (and maybe also manual vacuums) get progressively more aggressive about trying to advance relminmxid. Suppose we decide that when the "members" space is 75% used, we've got a big problem and want to treat autovacuum_multixact_freeze_max_age to effectively be zero. Conversely, suppose that when the members space is at a lower usage percentage than the offsets space, or when it's anyway less than autovacuum_multixact_freeze_max_age, we define that as completely acceptable. If we're somewhere above the "no problem" threshold but below the "big problem" threshold, then we calculate what percentage of the way we are from the "no problem" threshold to the "big problem" threshold and reduce autovacuum_multixact_freeze_max_age by that percentage. Example: autovacuum_multixact_freeze_max_age is 25% of 2^32 and we're using 20% of the offsets space but 40% of the members space. We're (40 - 20) / (75 - 20) = 36% of the way to the 75% threshold we never want to exceed, so we reduce the effective value of autovacuum_multixact_freeze_max_age by 36%. In this case, that means treating the configured value of 1073741824 as if it were 1073741824 * (35/55) = 683290251. Hopefully that's enough to trigger enough vacuuming to cause some relminmxid advancement, but if not, and the situation continues to worsen, we'll get more and more aggressive. This may not be the right proposal in detail, but I think we should do something. I don't like the idea telling users that they can no longer count on autovacuum to prevent wraparound, and that if they don't manually tune their vacuum settings correctly, their system may just stop working at some point. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Kevin Grittner
Date:
Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Apr 21, 2015 at 5:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>> Here's a patch. I have tested locally and it closes the issue >>> for me. If those affected can confirm that it stops the file >>> removal from happening, I'd appreciate it. >> >> Based on initial testing, it seems to stop file removal from >> happening rather too well. Before applying the patch I ran a test >> test that generated files 0000 to 1185D in the members directory. >> Even setting vacuum_multixact_freeze_min_age and >> vacuum_multixact_freeze_table_age very low, none of the members >> files would go away with VACUUM (with and without FREEZE) followed >> by CHECKPOINT. After applying the patch and starting with a fresh >> initdb, with very low settings of the vacuum_multixact_* GUCs I get >> the new error almost immediately, while the only file in the >> members subdirectory is 0000 and it is 8kB in size. >> >> I think the calculations might be wrong, but I'm not sure what does >> make sense. > Can anyone else reproduce this? I think I see why I was seeing this and nobody else was -- I was testing the cleanup on an otherwise quiescent cluster. It seems that no amount of VACUUM and CHECKPOINT will clean up the members subdirectory in the absence of processes adding more members. But after performing the VACUUMs and CHECKPOINT if I start the test program to add more multi-transactions with lots of members, *then* the members subdirectory gets cleaned up. That seems confusing and a bit annoying, but is otherwise benign. I would put "allow VACUUM followed by CHECKPOINT to delete unneeded files from the members subdirectory" on the "nice to have" list, but would not want to delay a fix for the corruption issues for it. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > I think I see why I was seeing this and nobody else was Thomas said he reproduced it. No? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Kevin Grittner
Date:
Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgrittn@ymail.com> wrote= : >> I think I see why I was seeing this and nobody else was > > Thomas said he reproduced it. No? I should have been more clear about what I meant by "this". Thomas said he reproduced the immediate errors with =C3=81lvaro's patch, but if he said anything about files in the members subdirectory not going away with VACUUM followed by CHECKPOINT, regardless of configuration, I missed it. It turns out that these steps only "prime the pump" for the files to be deleted on subsequent access to the members SLRU. That doesn't contribute to database corruption, but it sure can be confusing for someone trying to clean things up. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Robert Haas wrote: > On Thu, Apr 23, 2015 at 9:59 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Thomas Munro wrote: > >> That's why I proposed not using xid-like logic, and instead using a > >> type of three-way comparison that allows you to see when nextOffset > >> would 'cross' oldestOffsetStopLimit, instead of the two-way comparison > >> that considers half the number-space to be in the past and half in the > >> future, in my earlier message. > > > > Yeah, that bit made sense to me. > > In addition to preventing the corruption, I think we also need a > back-patchable fix for AV to try to keep this situation from happening > in the first place. Let me push a patch to fix the corruption, and then we can think of ways to teach autovacuum about the problem. I'm not optimistic about that, honestly: as all GUC settings, these are individual for each process, and there's no way for one process to affect the values that are seen by other processes (autovac workers). The only idea that comes to mind is to publish values in shared memory, and autovac workers would read them from there instead of using normal GUC values. > What I think we should do is notice when members utilization exceeds > offset utilization and progressively ramp back the effective value of > autovacuum_multixact_freeze_max_age (and maybe also > vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age) > so that autovacuum (and maybe also manual vacuums) get progressively > more aggressive about trying to advance relminmxid. Suppose we decide > that when the "members" space is 75% used, we've got a big problem and > want to treat autovacuum_multixact_freeze_max_age to effectively be > zero. I think we can easily determine the rate of multixact member space consumption and compare to the rate of multixact ID consumption; considering the historical multixact size (number of members per multixact) it would be possible to change the freeze ages by the same fraction, so that autovac effectively behaves as if the members consumption rate is what is driving the freezing instead of ID consumption rate. That way, we don't have to jump suddenly from "normal" to "emergency" behavior as some fixed threshold. > This may not be the right proposal in detail, but I think we should do > something. No disagreement on that. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Kevin Grittner wrote: > Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgrittn@ymail.com> wrote: > >> I think I see why I was seeing this and nobody else was > > > > Thomas said he reproduced it. No? > > I should have been more clear about what I meant by "this". Thomas > said he reproduced the immediate errors with Álvaro's patch, but if > he said anything about files in the members subdirectory not going > away with VACUUM followed by CHECKPOINT, regardless of > configuration, I missed it. It turns out that these steps only > "prime the pump" for the files to be deleted on subsequent access > to the members SLRU. That doesn't contribute to database > corruption, but it sure can be confusing for someone trying to > clean things up. The whole matter of truncating multixact is a longish trip. It starts when autovacuum completes a round or VACUUM finishes processing a table; these things call vac_update_datfrozenxid. That routine scans pg_class and sees if datfrozenxid or datminmxid can be advanced from their current points; only if any of these can, vac_truncate_clog is called. That routine calls SetMultiXactIdLimit(), which determines a new MultiXactState->oldestMultiXactId (saved in shared memory). The involvement of vacuum stops here; following steps happen at checkpoint. At checkpoint, oldestMultiXactId is saved to pg_control as part of a checkpoint (MultiXactGetCheckptMulti); the checkpointed value is passed back to multixact by MultiXactSetSafeTruncate, which saves it in shmem as lastCheckpointedOldest. The same checkpoint later calls TruncateMultiXact which can remove files. Note that if vac_update_datfrozenxid finds that the pg_database values cannot be changed (during the vacuum phase), the multixact truncation point is not advanced and checkpoint has nothing to do. But note that the clog counter advancing enough will also trigger multixact truncation. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Mon, Apr 27, 2015 at 10:59 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Let me push a patch to fix the corruption, and then we can think of ways > to teach autovacuum about the problem. Sounds good to me. Are you going to do that today? > I'm not optimistic about that, > honestly: as all GUC settings, these are individual for each process, > and there's no way for one process to affect the values that are seen by > other processes (autovac workers). The only idea that comes to mind is > to publish values in shared memory, and autovac workers would read them > from there instead of using normal GUC values. I don't think we could store values for the parameters directly in shared memory, because I think that at least some of those GUCs are per-session changeable. But we might be able to store weighting factors in shared memory that get applied to whatever the values in the current session are. Or else maybe each backend can just recompute the information for itself when it needs it. >> What I think we should do is notice when members utilization exceeds >> offset utilization and progressively ramp back the effective value of >> autovacuum_multixact_freeze_max_age (and maybe also >> vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age) >> so that autovacuum (and maybe also manual vacuums) get progressively >> more aggressive about trying to advance relminmxid. Suppose we decide >> that when the "members" space is 75% used, we've got a big problem and >> want to treat autovacuum_multixact_freeze_max_age to effectively be >> zero. > > I think we can easily determine the rate of multixact member space > consumption and compare to the rate of multixact ID consumption; > considering the historical multixact size (number of members per > multixact) it would be possible to change the freeze ages by the same > fraction, so that autovac effectively behaves as if the members > consumption rate is what is driving the freezing instead of ID > consumption rate. That way, we don't have to jump suddenly from > "normal" to "emergency" behavior as some fixed threshold. Right. I think that not jumping from normal mode to emergency mode is quite important, and was trying to describe a system that would gradually ramp up the pressure rather than a system that would do nothing for a while and then suddenly go ballistic. With regard to what you've outlined here, we need to make sure that if the multixact rate varies widely, we still clean up before we hit autovac wraparond. That's why I think it should be driven off of the fraction of the available address space which is currently consumed, not some kind of short term measure of mxact size or generation rate. I'm not sure exactly what you have in mind here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
David Gould
Date:
On Mon, 27 Apr 2015 11:59:10 -0300 Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I think we can easily determine the rate of multixact member space > consumption and compare to the rate of multixact ID consumption; > considering the historical multixact size (number of members per > multixact) it would be possible to change the freeze ages by the same > fraction, so that autovac effectively behaves as if the members > consumption rate is what is driving the freezing instead of ID > consumption rate. That way, we don't have to jump suddenly from > "normal" to "emergency" behavior as some fixed threshold. I would like to add a data point: one of my clients has a plpgsql function that manages to use ten to 30 thousand multixact ids per invocation. It interacts with a remote resource and sets an exception handler on a per item basis to catch errors on the remote call. -dg -- David Gould 510 282 0869 daveg@sonic.net If simplicity worked, the world would be overrun with insects.
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Tue, Apr 28, 2015 at 4:40 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 27, 2015 at 10:59 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> I think we can easily determine the rate of multixact member space >> consumption and compare to the rate of multixact ID consumption; >> considering the historical multixact size (number of members per >> multixact) it would be possible to change the freeze ages by the same >> fraction, so that autovac effectively behaves as if the members >> consumption rate is what is driving the freezing instead of ID >> consumption rate. That way, we don't have to jump suddenly from >> "normal" to "emergency" behavior as some fixed threshold. > > Right. I think that not jumping from normal mode to emergency mode is > quite important, and was trying to describe a system that would > gradually ramp up the pressure rather than a system that would do > nothing for a while and then suddenly go ballistic. > > With regard to what you've outlined here, we need to make sure that if > the multixact rate varies widely, we still clean up before we hit > autovac wraparond. That's why I think it should be driven off of the > fraction of the available address space which is currently consumed, > not some kind of short term measure of mxact size or generation rate. > I'm not sure exactly what you have in mind here. Here is a work-in-progress-patch that: (1) Fixes the boundary bug already discussed (as before, except that I fixed up the comments in MultiXactOffsetWouldWrap() to survive pgindent based on feedback from Kevin). (2) Makes autovacuum adjust the effective value of autovacuum_multixact_freeze_max_age down to zero as the fraction of used addressable member offset range approaches 75%, and also makes vacuum_multixact_freeze_min_age use the same value if it is lower than the configured min age. When I run this with my explode_mxact_members.c program, autovacuum moves datminmxid forwards, avoiding the error. The algorithm in autovacuum_multixact_freeze_max_age_adjusted is possibly too simple, but it should at least spread out wraparound autovacuums rather than suddenly going ballistic. Do you think it's OK that MultiXactAdvanceNextMXact still uses MultiXactOffsetPrecedes to compare two member offsets? Can the value from an XLog record being replayed be more than 2.2 billion away from MultiXactState->nextOffset? (The attachment memberswrap-2.patch is a diff against master, and memberswrap-2-incremental.patch is a diff against master with Alvaro's patch applied). -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Thomas Munro wrote: > One thing I noticed about your patch is that it effectively halves the > amount of multixact members you can have on disk. Sure, I'd rather > hit an error at 2^31 members than a corrupt database at 2^32 members, > but I wondered if we should try to allow the full range to be used. Ah, yeah, we do want the full range; that's already built in the code elsewhere. In this version, I used your WouldWrap function, but there was a bug in your formulation of the call site: after the WARNING has been issued once, it is never issued again for that wraparound cycle, because the second time around the nextOffset has already crossed the boundary and your routine returns false. IMO this is wrong and the warning should be issued every time. To fix that problem I removed the offsetWarnLimit altogether, and instead do WouldWrap() of the value against offsetStopLimit minus the 20 segments. That way, the warning is issued continuously until the offsetStopLimit is reached (once there, obviously, only the error is thrown, not the warning, which is correct.) I also added a call to DetermineSafeOldestOffset() in TrimMultiXact: as far as I can tell, this is necessary for the time when a standby exits recovery, because when InRecovery we return early from DetermineSafeOldestOffset() so the safe point would never get set. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Tue, Apr 28, 2015 at 6:23 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Thomas Munro wrote: > >> One thing I noticed about your patch is that it effectively halves the >> amount of multixact members you can have on disk. Sure, I'd rather >> hit an error at 2^31 members than a corrupt database at 2^32 members, >> but I wondered if we should try to allow the full range to be used. > > Ah, yeah, we do want the full range; that's already built in the code > elsewhere. > > In this version, I used your WouldWrap function, but there was a bug in > your formulation of the call site: after the WARNING has been issued > once, it is never issued again for that wraparound cycle, because the > second time around the nextOffset has already crossed the boundary and > your routine returns false. IMO this is wrong and the warning should be > issued every time. To fix that problem I removed the offsetWarnLimit > altogether, and instead do WouldWrap() of the value against > offsetStopLimit minus the 20 segments. That way, the warning is issued > continuously until the offsetStopLimit is reached (once there, > obviously, only the error is thrown, not the warning, which is correct.) +1 Tomorrow I will send a separate patch for the autovacuum changes that I sent earlier. Let's discuss and hopefully eventually commit that separately. -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Tue, Apr 28, 2015 at 2:23 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Ah, yeah, we do want the full range; that's already built in the code > elsewhere. > > In this version, I used your WouldWrap function, but there was a bug in > your formulation of the call site: after the WARNING has been issued > once, it is never issued again for that wraparound cycle, because the > second time around the nextOffset has already crossed the boundary and > your routine returns false. IMO this is wrong and the warning should be > issued every time. To fix that problem I removed the offsetWarnLimit > altogether, and instead do WouldWrap() of the value against > offsetStopLimit minus the 20 segments. That way, the warning is issued > continuously until the offsetStopLimit is reached (once there, > obviously, only the error is thrown, not the warning, which is correct.) > > I also added a call to DetermineSafeOldestOffset() in TrimMultiXact: > as far as I can tell, this is necessary for the time when a standby > exits recovery, because when InRecovery we return early from > DetermineSafeOldestOffset() so the safe point would never get set. Putting the period inside the parentheses here looks weird? + "This command would create a multixact with %u members, which exceeds remaining space (%u members.)", Maybe rephrase as: "This command would create a multixact with %u members, but the remaining space is only enough for %u members." I don't think this should have a comma: + errhint("Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings."))); This looks like excess brace-ifiaction: + if (start < boundary) + { + return finish >= boundary || finish < start; + } + else + { + return finish >= boundary && finish < start; + } I think this is confusing: -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Tue, Apr 28, 2015 at 10:33 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think this is confusing: Oops, hit send too soon. +/* + * Read the offset of the first member of the given multixact. + */ This is confusing to me because the two subdirectories of pg_multixact are called "members" and "offsets". Here you are talking about the offset of the first member. Maybe I'm just slow, but that seems like conflating terminology. You end up with a function called read_offset_for_multi() that is actually looking up information about members. Ick. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Thomas Munro wrote: > > In this version, I used your WouldWrap function, [...] > > +1 Pushed. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
I sure wish this had arrived two minutes earlier ... Robert Haas wrote: > Putting the period inside the parentheses here looks weird? > > + "This command would create a > multixact with %u members, which exceeds remaining space (%u > members.)", > > Maybe rephrase as: "This command would create a multixact with %u > members, but the remaining space is only enough for %u members." WFM, will change. > I don't think this should have a comma: > > + errhint("Execute a database-wide VACUUM in that > database, with reduced vacuum_multixact_freeze_min_age and > vacuum_multixact_freeze_table_age settings."))); Ditto. > This looks like excess brace-ifiaction: > > + if (start < boundary) > + { > + return finish >= boundary || finish < start; > + } > + else > + { > + return finish >= boundary && finish < start; > + } Yeah, agreed. Will undo that change. (I disliked the comment above the indented single-statement, so added braces, but then moved the comment. I should have removed the braces at that point.) > I think this is confusing: > > +/* > + * Read the offset of the first member of the given multixact. > + */ > > This is confusing to me because the two subdirectories of pg_multixact > are called "members" and "offsets". Here you are talking about the > offset of the first member. Maybe I'm just slow, but that seems like > conflating terminology. You end up with a function called > read_offset_for_multi() that is actually looking up information about > members. Ick. Yeah, I introduced the confusing terminology while inventing multixacts initially and have regretted it many times. I will think about a better name for this. (Meanwhile, on IM Robert suggested find_start_of_first_multi_member) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Tue, Apr 28, 2015 at 10:56 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I sure wish this had arrived two minutes earlier ... Sorry about that. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Alvaro Herrera wrote: > > I think this is confusing: > > > > +/* > > + * Read the offset of the first member of the given multixact. > > + */ > > > > This is confusing to me because the two subdirectories of pg_multixact > > are called "members" and "offsets". Here you are talking about the > > offset of the first member. Maybe I'm just slow, but that seems like > > conflating terminology. You end up with a function called > > read_offset_for_multi() that is actually looking up information about > > members. Ick. > > Yeah, I introduced the confusing terminology while inventing multixacts > initially and have regretted it many times. I will think about a better > name for this. (Meanwhile, on IM Robert suggested > find_start_of_first_multi_member) Pushed. I chose find_multixact_start() as a name for this function. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Jeff Janes
Date:
On Tue, Apr 28, 2015 at 10:54 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Alvaro Herrera wrote: > > > > I think this is confusing: > > > > > > +/* > > > + * Read the offset of the first member of the given multixact. > > > + */ > > > > > > This is confusing to me because the two subdirectories of pg_multixact > > > are called "members" and "offsets". Here you are talking about the > > > offset of the first member. Maybe I'm just slow, but that seems like > > > conflating terminology. You end up with a function called > > > read_offset_for_multi() that is actually looking up information about > > > members. Ick. > > > > Yeah, I introduced the confusing terminology while inventing multixacts > > initially and have regretted it many times. I will think about a better > > name for this. (Meanwhile, on IM Robert suggested > > find_start_of_first_multi_member) > > Pushed. I chose find_multixact_start() as a name for this function. > Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this just-described commit, I can no longer upgrade from a 9.2.10 database using pg_upgrade. I can reproduce it from a clean 9.2 install which has never even been started up. Deleting files from new pg_multixact/offsets ok Setting oldest multixact ID on new cluster ok Resetting WAL archives ok *failure* Consult the last few lines of "pg_upgrade_server.log" for the probable cause of the failure. The last few lines are: command: "../bisect/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "../data2/" -o "-p 50432 -b -c synchronous_commit=off -c fsync=off -c full_page_writes=off -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/home/jjanes/pgsql/git'" start >> "pg_upgrade_server.log" 2>&1 waiting for server to start....LOG: database system was shut down at 2015-04-28 11:08:18 PDT FATAL: could not access status of transaction 1 DETAIL: Could not open file "pg_multixact/offsets/0000": No such file or directory. LOG: startup process (PID 3977) exited with exit code 1 LOG: aborting startup due to startup process failure Cheers, Jeff
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Jeff Janes wrote: > Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this > just-described commit, I can no longer upgrade from a 9.2.10 database using > pg_upgrade. How annoying, thanks for the report. I reproduced it here. The problem is that the upgrade process removes the files from pg_multixact/offsets, which is what we now want to read on startup. Not yet sure how to fix it. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Jeff Janes wrote: > Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this > just-described commit, I can no longer upgrade from a 9.2.10 database using > pg_upgrade. Here's a patch, but I don't like it too much. Will think more about it, probably going to push something tomorrow. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Tue, Apr 28, 2015 at 6:30 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Tomorrow I will send a separate patch for the autovacuum changes that > I sent earlier. Let's discuss and hopefully eventually commit that > separately. Here is a separate patch which makes autovacuum start a wrap-around vacuum sooner if the member space is running out, by adjusting autovacuum_multixact_freeze_max_age using a progressive scaling factor. This version includes a clearer implementation of autovacuum_multixact_freeze_max_age_adjusted provided by Kevin Grittner off-list. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Tue, Apr 21, 2015 at 5:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Tue, Apr 21, 2015 at 12:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: >> >> Alvaro Herrera wrote: >> >> > The fix is to raise an ERROR when generating a new multixact, if we >> > detect that doing so would get close to the oldest multixact that the >> > system knows about. If that happens, the solution is to vacuum so that >> > the "oldest" point is advanced a bit more and you have room to generate >> > more multixacts. In production, you would typically adjust the >> > multixact freeze parameters so that "oldest multixact" is advanced more >> > aggressively and you don't hit the ERROR. >> >> Here's a patch. I have tested locally and it closes the issue for me. >> If those affected can confirm that it stops the file removal from >> happening, I'd appreciate it. >> > > 1. Do you think it makes sense to give warning in SetMultiXactIdLimit() > if we have already reached offsetWarnLimit as we give for multiWarnLimit? Amit and I discussed this offline. Yes, we could include a warning message here, for consistency with the warnings you get about xid wraparound. Concretely I think it means that you would also get warnings about being being near the member space limit from vacuums, rather than just from attempts to allocate new multixact IDs. The test to detect an impending members-would-wrap ERROR would be similar to what we do in GetNewMultiXactId, so something like: MultiXactOffsetWouldWrap(offsetStopLimit, nextOffset, MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT * OFFSET_WARN_SEGMENTS) I'm not sure whether it's worth writing an extra patch for this though, because if you're in this situation, your logs are already overflowing with warnings from the regular backends that are generating multixacts. Thoughts anyone? -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Amit Kapila
Date:
On Wed, Apr 29, 2015 at 5:44 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Tue, Apr 28, 2015 at 6:30 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > Tomorrow I will send a separate patch for the autovacuum changes that > > I sent earlier. Let's discuss and hopefully eventually commit that > > separately. > > Here is a separate patch which makes autovacuum start a wrap-around > vacuum sooner if the member space is running out, by adjusting > autovacuum_multixact_freeze_max_age using a progressive scaling > factor. This version includes a clearer implementation of > autovacuum_multixact_freeze_max_age_adjusted provided by Kevin > Grittner off-list. > Some comments: 1. It seems that you are using autovacuum_multixact_freeze_max_age_adjusted() only at couple of places, like it is not used in below calc: vacuum_set_xid_limits() { .. mxid_freezemin = Min(mxid_freezemin, autovacuum_multixact_freeze_max_age / 2); .. } What is the reason of using this calculation at some places and not at other places? 2. @@ -2684,8 +2719,8 @@ relation_needs_vacanalyze(Oid relid, : autovacuum_freeze_max_age; multixact_freeze_max_age = (relopts && relopts->multixact_freeze_max_age >= 0) - ? Min(relopts->multixact_freeze_max_age, autovacuum_multixact_freeze_max_age) - : autovacuum_multixact_freeze_max_age; + ? Min(relopts->multixact_freeze_max_age, autovacuum_multixact_freeze_max_age_adjusted()) + : autovacuum_multixact_freeze_max_age_adjusted(); It seems that it will try to read from offset file for each relation which might or might not be good, shall we try to cache the oldestMultiXactMemberOffset? 3. currently there is some minimum limit of autovacuum_multixact_freeze_age (10000000) which might not be honored by this calculation, so not sure if that can impact the system performance in some cases where it is currently working sane. 4. Can you please share results that can show improvement with current patch versus un-patched master? 5. + /* + * TODO: In future, could oldestMultiXactMemberOffset be stored in shmem, + * pg_controdata, alongside oldestMultiXactId? + */ You might want to write the comment as: XXX: We can store oldestMultiXactMemberOffset in shmem, pg_controldata alongside oldestMultiXactId? 6. + * Returns vacuum_multixact_freeze_max_age, adjusted down to prevent excessive use + * of addressable multixact member space if required. I think here you mean autovacuum_multixact_freeze_max_age? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Jeff Janes
Date:
On Tue, Apr 28, 2015 at 4:13 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Jeff Janes wrote: > > Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this > > just-described commit, I can no longer upgrade from a 9.2.10 database > using > > pg_upgrade. > > Here's a patch, but I don't like it too much. Will think more about it, > probably going to push something tomorrow. > It looks like that patch is targeted to 9.4 branch. I couldn't readily get it to apply on HEAD. I tested it on 9.4, and it solved the problem there. Thanks, Jeff
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Amit Kapila
Date:
On Tue, Apr 28, 2015 at 11:24 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Alvaro Herrera wrote: > > > Pushed. I chose find_multixact_start() as a name for this function. > I have done test to ensure that the latest change has fixed the reported problem and below are the results, to me it looks the reported problem is fixed. I have used test (explode_mxact_members) developed by Thomas to reproduce the problem. Start one transaction in a session. After running the test for 3~4 hours with parameters as explode_mxact_members 500 35000, I could see the warning messages like below (before the fix there were no such messages and test is completed but it has corrupted the database): WARNING: database with OID 1 must be vacuumed before 358 more multixact members are used HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings. WARNING: database with OID 1 must be vacuumed before 310 more multixact members are used HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings. WARNING: database with OID 1 must be vacuumed before 261 more multixact members are used HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings. WARNING: database with OID 1 must be vacuumed before 211 more multixact members are used HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings. WARNING: database with OID 1 must be vacuumed before 160 more multixact members are used HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings. explode_mxact_members: explode_mxact_members.c:38: main: Assertion `PQresultStatus(res) == PGRES_TUPLES_OK' failed. After this I set the vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age as zero and then performed Vacuum freeze for template1 and postgres followed by manual CHECKPOINT. I could see below values in pg_database. postgres=# select oid,datname,datminmxid from pg_database; oid | datname | datminmxid -------+-----------+------------ 1 | template1 | 17111262 13369 | template0 | 17111262 13374 | postgres | 17111262 (3 rows) Again I start the test as ./explode_mxact_members 500 35000, but it immediately failed as 500 sessions connected... Loop 0... WARNING: database with OID 13369 must be vacuumed before 12 more multixact members are used HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings. WARNING: database with OID 13369 must be vacuumed before 11 more multixact members are used HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings. WARNING: database with OID 13369 must be vacuumed before 9 more multixact members are used HINT: Execute a database-wide VACUUM in that database, with reduced vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age settings. explode_mxact_members: explode_mxact_members.c:38: main: Assertion `PQresultStatus(res) == PGRES_TUPLES_OK' failed. Now it was confusing for me why it has failed for next time even though I had Vacuum Freeze and CHECKPOINT, but then I waited for a minute or two and ran Vacuum Freeze by below command: ./vacuumdb -a -F vacuumdb: vacuuming database "postgres" vacuumdb: vacuuming database "template1" Here I have verified that all files except one were deleted. After that when I restarted the test, it went perfectly fine and it never lead to any warning messages, probable because the values for vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age were zero. I am still not sure why it took some time to clean the members directory and resume the test after running Vacuum Freeze and Checkpoint. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Wed, Apr 29, 2015 at 11:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > 1. It seems that you are using > autovacuum_multixact_freeze_max_age_adjusted() > only at couple of places, like it is not used in below calc: > > vacuum_set_xid_limits() > { > .. > mxid_freezemin = Min(mxid_freezemin, > autovacuum_multixact_freeze_max_age / 2); > .. > } > > What is the reason of using this calculation at some places and > not at other places? You're right, I will review the other places where that variable is used and come back on this point. > 2. > @@ -2684,8 +2719,8 @@ relation_needs_vacanalyze(Oid relid, > : autovacuum_freeze_max_age; > > multixact_freeze_max_age = (relopts && relopts->multixact_freeze_max_age >>= 0) > - ? Min(relopts->multixact_freeze_max_age, > autovacuum_multixact_freeze_max_age) > - : autovacuum_multixact_freeze_max_age; > + ? Min(relopts->multixact_freeze_max_age, > autovacuum_multixact_freeze_max_age_adjusted()) > + : autovacuum_multixact_freeze_max_age_adjusted(); > > > It seems that it will try to read from offset file for each > relation which might or might not be good, shall we try to > cache the oldestMultiXactMemberOffset? True. I will look into this. > 3. currently there is some minimum limit of autovacuum_multixact_freeze_age > (10000000) > which might not be honored by this calculation, so not sure if that can > impact the > system performance in some cases where it is currently working sane. The reason why we need to be able to set the effective freeze age below that minimum in cases of high member data consumption rates is that you could hit the new member space wraparound prevention error before you consume anywhere near that many multixact IDs. That minimum may well be entirely reasonable if the only thing you're worried about is multixact ID wraparound prevention. For example, my test program eats an average of 250 members per multixact ID when run with 500 sessions (each loop creates 500 multixact IDs having 1, 2, 3, ..., 500 members). At that rate, you'll run out of addressable member space after 2^32 / 250 = 17,179,869 multixact IDs. To prevent an error condition using only the existing multixact ID wraparound prevention machinery, we need to have an effective max table age (so that autovacuum wakes up and scans all tables) and min freeze age (so that it actually freezes the tuples) below that number. So we have to ignore the GUC minimum in this situation. > 4. Can you please share results that can show improvement > with current patch versus un-patched master? Here is my test procedure: initdb, default settings except max_connections = 600 and shared_buffers = 1280MB, "CREATE TABLE foo AS SELECT 42 AS id;", start monitor.sh (attached) and then run explode_mxact_members.c (attached to my first message in this thread) with arguments 500 100000. Here are the edited highlights of the result for the patched server (the complete results are attached): time segments usage_fraction usage_kb oldest_mxid next_mxid next_offset 10:37:01 1 0 16 1 1 0 10:38:01 656 .0079 168368 1 1 0 10:39:01 1426 .0173 366112 1 190619 47845119 10:40:01 2201 .0268 565080 1 384730 96566980 10:41:01 2979 .0363 764760 1 578342 145163592 ... 11:58:57 60455 .7368 15529256 1 12601248 3162912998 11:59:59 61184 .7457 15717760 1 12601248 3162912998 12:01:01 61918 .7547 15906288 1 12794860 3211509610 12:02:03 2178 .0265 559360 12610230 12988472 3260106222 <-- oldest_mxid moved 12:03:03 2936 .0357 753936 12610230 13182583 3308828083 12:04:03 3702 .0451 950704 12610230 13376694 3357549944 ... 13:18:23 58236 .7098 14956680 12610230 24623156 1885444610 13:19:23 58965 .7187 15143520 12610230 24817267 1934166471 13:20:24 59690 .7275 15329864 12610230 25011378 1982888332 13:21:24 59709 .7278 15334664 12758932 25204990 2031484944 <-- oldest_mxid moved 13:22:25 60447 .7367 15524464 12758932 25204990 2031484944 13:23:25 60836 .7415 15624192 12833782 25399101 2080206805 <-- oldest_mxid moved 13:24:26 25068 .3055 6438312 20447005 25592713 2128803417 13:25:26 25826 .3147 6632888 20447005 25786824 2177525278 13:26:26 26583 .3240 6827456 20447005 25980935 2226247139 ... 14:10:42 59115 .7205 15182368 20447005 32767834 3929758788 14:11:42 59836 .7293 15367912 20447005 32767834 3929758788 14:12:43 60571 .7383 15556680 20447005 32961446 3978355400 14:13:43 38557 .4699 9902472 25189521 33155557 4027077261 <-- oldest_mxid moved 14:14:44 39267 .4786 10084408 25189521 33349169 4075673873 14:15:44 39975 .4872 10266816 25189521 33349169 4075673873 ... 14:43:56 59805 .7289 15359072 25189521 37615619 851585527 14:44:56 60492 .7373 15536024 25189521 37615619 851585527 14:45:57 60468 .7370 15529536 25342215 37809231 900182139 14:46:58 48309 .5888 12406904 28023821 38003342 948904000 <-- oldest_mxid moved 14:47:58 49010 .5973 12586784 28023821 38196954 997500612 14:48:58 49716 .6059 12768152 28023821 38391065 1046222473 ... 15:02:04 58882 .7177 15122600 28023821 40136068 1484218226 15:03:04 59588 .7263 15303856 28023821 40330179 1532940087 15:04:05 60293 .7349 15484592 28023821 40523791 1581536699 15:05:06 24657 .3005 6332800 35601635 40717902 1630258560 <-- oldest_mxid moved 15:06:06 25397 .3095 6522664 35601635 40717902 1630258560 15:07:06 26140 .3186 6713296 35601635 40912013 1678980421 ... 15:52:22 58685 .7153 15071912 35601635 47698912 3382492070 15:53:23 59403 .7240 15255936 35601635 47892524 3431088682 15:54:23 60118 .7327 15439648 35601635 48086635 3479810543 <-- oldest_mxid moved 15:55:24 50415 .6145 12947712 37776297 48280247 3528407155 15:56:24 51144 .6234 13135568 37776297 48280247 3528407155 15:57:25 51869 .6322 13321872 37776297 48474358 3577129016 ... Observations: 1. Sometimes the values don't change from minute to minute, presumably because there hasn't been a checkpoint to update pg_controldata on disk, but hopefully we can still see what's going on here despite the slight lag in the data. 2. We get to somewhere in the 73-75% SLRU used range before wraparound vacuums are triggered. We probably need to spread things out more that that. 3. When the autovacuum runs, it advances oldest_mxid by different amounts each time; that's because I'm using the adjusted freeze max age (the max age of a table before it gets a wraparound vacuum) as our freeze min age (the max age for individual tuples before they're frozen) here: @@ -1931,7 +1964,9 @@ do_autovacuum(void) { default_freeze_min_age = vacuum_freeze_min_age; default_freeze_table_age = vacuum_freeze_table_age; - default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age; + default_multixact_freeze_min_age = + Min(vacuum_multixact_freeze_min_age, + autovacuum_multixact_freeze_max_age_adjusted()); default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age; } Without that change, autovacuum would trigger repeatedly as we got near 75% SLRU usage but not freeze anything, because default_multixact_freeze_min_age was higher than the age of any tuples (which had only made it to an age of around ~12 million; actually it's not exactly the tuple age per se... I don't fully understand the treatment of locker and updater multixact IDs in the vacuum code, HeapTupleSatisfiesVacuum and heap_freeze_tuple etc yet so I'm not sure exactly how that value translates into vacuum work, but I can see experimentally that a low multixact freeze min age is needed to get relminxmid moved forward). It's good that freeze table age ramps down so that the autovacuum launcher trigger point jumps around a bit and we spread the autovacuum launches over time, but it's not great that we finish up truncating different amounts of multixacts and associated SLRU each time. We could instead use a freeze min age of 0 to force freezing of *all* tuples if this is a member-space-wraparound-prevention vacuum (that is, if autovacuum_multixact_freeze_max_age != autovacuum_multixact_freeze_max_age_adjusted()). IIUC that'd be like running VACUUM FREEZE and would space out our wraparound vacuums as far apart as possible since it would trims SLRU space as much possible when a wraparound vacuum runs, but it would presumably generate more writes. Maybe it doesn't matter that we drop different amounts of multixact history each time, as long as the long term behaviour works. Thoughts? There is less to say about the results with an unpatched server: it drives in a straight line for a while, and then crashes into a wall (ie the new error preventing member wraparound), which I see you have also reproduced. It's used up all of the circular member space, but only has around 17 million multixacts so autovacuum can't help you (it's not even possible to set autovacuum_multixact_freeze_max_age below 100 million), so to get things moving again you need to manually VACUUM FREEZE all databases including template databases. > 5. > + /* > + * TODO: In future, could oldestMultiXactMemberOffset be stored in shmem, > + * > pg_controdata, alongside oldestMultiXactId? > + */ > > You might want to write the comment as: > XXX: We can store oldestMultiXactMemberOffset in shmem, pg_controldata > alongside oldestMultiXactId? Thanks, done. > 6. > + * Returns vacuum_multixact_freeze_max_age, adjusted down to prevent > excessive use > + * of addressable > multixact member space if required. > > I think here you mean autovacuum_multixact_freeze_max_age? Thanks, done. FWIW, in some future release, I think we should consider getting a bigger multixact member address space that wraps around at 2^48 or 2^64 instead of 2^32, so that we can sidestep the whole business and go back to having just xid and mxid wraparounds to worry about. pg_multixact/offsets would be 50% or 100% bigger (an extra byte or two per multixact), but it's not very big. pg_multiact/members would be no bigger for any workload that currently works without hitting the wraparound error, but could grow bigger if needed. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Tue, Apr 28, 2015 at 7:13 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Jeff Janes wrote: >> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this >> just-described commit, I can no longer upgrade from a 9.2.10 database using >> pg_upgrade. > > Here's a patch, but I don't like it too much. Will think more about it, > probably going to push something tomorrow. What don't you like about it? We should get something committed here; it's not good for the back-branches to be in a state where pg_upgrade will break. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Robert Haas wrote: > On Tue, Apr 28, 2015 at 7:13 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > > Jeff Janes wrote: > >> Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to this > >> just-described commit, I can no longer upgrade from a 9.2.10 database using > >> pg_upgrade. > > > > Here's a patch, but I don't like it too much. Will think more about it, > > probably going to push something tomorrow. > > What don't you like about it? We should get something committed here; > it's not good for the back-branches to be in a state where pg_upgrade > will break. Yeah, I managed to find a real fix which I will push shortly. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Jeff Janes wrote: > On Tue, Apr 28, 2015 at 4:13 PM, Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: > > > Jeff Janes wrote: > > > Starting with commit b69bf30b9bfacafc733a9ba7 and continuing to > > > this just-described commit, I can no longer upgrade from a 9.2.10 > > > database using pg_upgrade. > > > > Here's a patch, but I don't like it too much. Will think more about it, > > probably going to push something tomorrow. > > It looks like that patch is targeted to 9.4 branch. I couldn't readily get > it to apply on HEAD. I tested it on 9.4, and it solved the problem there. Yeah, I wrote it in 9.3. However, it was wrong; or at least there's a better way to formulate it, and the new formulation applies without conflict from 9.3 to master. So I pushed that instead. Thanks! -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Thu, Apr 30, 2015 at 5:17 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, Apr 29, 2015 at 11:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> 1. It seems that you are using >> autovacuum_multixact_freeze_max_age_adjusted() >> only at couple of places, like it is not used in below calc: >> >> vacuum_set_xid_limits() >> { >> .. >> mxid_freezemin = Min(mxid_freezemin, >> autovacuum_multixact_freeze_max_age / 2); >> .. >> } >> >> What is the reason of using this calculation at some places and >> not at other places? > > You're right, I will review the other places where that variable is > used and come back on this point. Those other places are for capping the effective table and tuple multixact freeze ages for manual vacuums, so that manual vacuums (say in nightly cronjobs) get a chance to run a wraparound scans before autovacuum kicks in at a less convenient time. So, yeah, I think we want to incorporate member wraparound prevention into that logic, and I will add that in the next version of the patch. >> 2. >> @@ -2684,8 +2719,8 @@ relation_needs_vacanalyze(Oid relid, >> : autovacuum_freeze_max_age; >> >> multixact_freeze_max_age = (relopts && relopts->multixact_freeze_max_age >>>= 0) >> - ? Min(relopts->multixact_freeze_max_age, >> autovacuum_multixact_freeze_max_age) >> - : autovacuum_multixact_freeze_max_age; >> + ? Min(relopts->multixact_freeze_max_age, >> autovacuum_multixact_freeze_max_age_adjusted()) >> + : autovacuum_multixact_freeze_max_age_adjusted(); >> >> >> It seems that it will try to read from offset file for each >> relation which might or might not be good, shall we try to >> cache the oldestMultiXactMemberOffset? > > True. I will look into this. The version attached reuses the value by passing it as a new parameter to the functions relation_needs_vacanalyze and table_recheck_autovac. [...] >> 4. Can you please share results that can show improvement >> with current patch versus un-patched master? > [...] > 2. We get to somewhere in the 73-75% SLRU used range before > wraparound vacuums are triggered. We probably need to spread things > out more that that. The version attached should spread the work out a lot better. Instead of using a proportion of your autovacuum_multixact_max_freeze_age, it uses a proportion of the number of active multixacts (see compute_max_multixact_age_to_avoid_member_wrap). > 3. When the autovacuum runs, it advances oldest_mxid by different > amounts each time; that's because I'm using the adjusted freeze max > age (the max age of a table before it gets a wraparound vacuum) as our > freeze min age (the max age for individual tuples before they're > frozen) here: > > @@ -1931,7 +1964,9 @@ do_autovacuum(void) > { > default_freeze_min_age = vacuum_freeze_min_age; > default_freeze_table_age = vacuum_freeze_table_age; > - default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age; > + default_multixact_freeze_min_age = > + Min(vacuum_multixact_freeze_min_age, > + autovacuum_multixact_freeze_max_age_adjusted()); > default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age; > } That was just completely wrong... removed, and replaced with code that sets multixact_freeze_min_age to zero in table_recheck_autovac in this scenario. I will post another version with some more test results soon, but wanted to share this WIP patch and respond to the above questions. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Fri, May 1, 2015 at 6:51 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Those other places are for capping the effective table and tuple > multixact freeze ages for manual vacuums, so that manual vacuums (say > in nightly cronjobs) get a chance to run a wraparound scans before > autovacuum kicks in at a less convenient time. So, yeah, I think we > want to incorporate member wraparound prevention into that logic, and > I will add that in the next version of the patch. +1. On a quick read-through of the patch, the biggest thing that jumped out at me was that it only touches the autovacuum logic. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Sat, May 2, 2015 at 7:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 1, 2015 at 6:51 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> Those other places are for capping the effective table and tuple >> multixact freeze ages for manual vacuums, so that manual vacuums (say >> in nightly cronjobs) get a chance to run a wraparound scans before >> autovacuum kicks in at a less convenient time. So, yeah, I think we >> want to incorporate member wraparound prevention into that logic, and >> I will add that in the next version of the patch. > > +1. On a quick read-through of the patch, the biggest thing that > jumped out at me was that it only touches the autovacuum logic. Here's a new version which sets up the multixact parameters in ExecVacuum for regular VACUUM commands just like it does for autovacuum if needed. When computing max_multixact_age_to_avoid_member_wrap for a manual vacuum, it uses lower constants, so that any manually scheduled vacuums get a chance to deal with some of this problem before autovacuum has to. Here are the arbitrary constants currently used: at 50% member address space usage, autovacuum starts wraparound scan of tables with the oldest active multixacts, and then younger ones as the usage increases, until at 75% usage it vacuums with multixact_freeze_table_age = 0; for manual VACUUM those numbers are halved so that it has a good head start. Halving the thresholds so much lower for manual vacuums may be too much of a head start, but if we give it only a small head start, it seems to me that you'd finish up with slim odds of actually getting the wraparound scans done by your scheduled vacuum job. A head start of 25% of the usage before autovacuum starts on tables with the oldest relminmxids creates a target big enough that you might hit it with your vacuum cronjob. Also, as far as I can tell, manual vacuums will only help you get the wraparound scans on your *connectable* databases done at a time that suits you better, you'll still be dependent on autovacuum to deal with non-connectable databases, in other words template0. In practice I guess if you run vacuumdb -a at midnight you'll see all the pg_database.datminmxid values advance except for template0's, and then some time later, most likely during busy load time producing many multixact members, member space usage will finally hit 50%, autovacuum will very quickly process template0, the cluster-wide oldest mxid will finally advance, and then segment files will be deleted at the next checkpoint. Or am I missing something? Also attached is the output of the monitor.sh script posted upthread, while running explode_mxact_members.c. It looks better than the last results to me: whenever usage reaches 50%, autovacuum advances things such that usage drops right back to 0% (because it now uses multixact_freeze_min_age = 0) , and the system will happily chug on forever. What this test doesn't really show adequately is that if you had a lot of different tables and databases with different relminmxid values, they'd be vacuumed at different times. I should probably come up with a way to demonstrate that... -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Amit Kapila
Date:
On Thu, Apr 30, 2015 at 10:47 AM, Thomas Munro < thomas.munro@enterprisedb.com> wrote: > > On Wed, Apr 29, 2015 at 11:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > 3. currently there is some minimum limit of autovacuum_multixact_freeze_age > > (10000000) > > which might not be honored by this calculation, so not sure if that can > > impact the > > system performance in some cases where it is currently working sane. > > The reason why we need to be able to set the effective freeze age > below that minimum in cases of high member data consumption rates is > that you could hit the new member space wraparound prevention error > before you consume anywhere near that many multixact IDs. That > minimum may well be entirely reasonable if the only thing you're > worried about is multixact ID wraparound prevention. > > For example, my test program eats an average of 250 members per > multixact ID when run with 500 sessions (each loop creates 500 > multixact IDs having 1, 2, 3, ..., 500 members). At that rate, you'll > run out of addressable member space after 2^32 / 250 = 17,179,869 > multixact IDs. To prevent an error condition using only the existing > multixact ID wraparound prevention machinery, we need to have an > effective max table age (so that autovacuum wakes up and scans all > tables) and min freeze age (so that it actually freezes the tuples) > below that number. So we have to ignore the GUC minimum in this > situation. > I understand that point, but I mentioned so that if there is some specific reason for keeping the current minimum value, then we should evaluate that we have not broken the same by not honouring the minimum value of GUC. As far as I can see from code, there seems to be one place (refer below code) where that value is used to calculate Warning limit for multixacts and the current patch doesn't seem to have any impact on the same. SetMultiXactIdLimit() { .. multiWarnLimit = multiStopLimit - 10000000; } > ... > > Observations: > > 1. Sometimes the values don't change from minute to minute, > presumably because there hasn't been a checkpoint to update > pg_controldata on disk, but hopefully we can still see what's going on > here despite the slight lag in the data. > Yeah and I think this means that there will no advancement for oldest multixactid and deletion of files if the checkpoints are configured for a timeout value. I think there is no harm in specifying this in document if it is currently not specified. > 2. We get to somewhere in the 73-75% SLRU used range before > wraparound vacuums are triggered. We probably need to spread things > out more that that. > > 3. When the autovacuum runs, it advances oldest_mxid by different > amounts each time; that's because I'm using the adjusted freeze max > age (the max age of a table before it gets a wraparound vacuum) as our > freeze min age (the max age for individual tuples before they're > frozen) here: > > @@ -1931,7 +1964,9 @@ do_autovacuum(void) > { > default_freeze_min_age = vacuum_freeze_min_age; > default_freeze_table_age = vacuum_freeze_table_age; > - default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age; > + default_multixact_freeze_min_age = > + Min(vacuum_multixact_freeze_min_age, > + autovacuum_multixact_freeze_max_age_adjusted()); > default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age; > } > > Without that change, autovacuum would trigger repeatedly as we got > near 75% SLRU usage but not freeze anything, because > default_multixact_freeze_min_age was higher than the age of any tuples > (which had only made it to an age of around ~12 million; actually it's > not exactly the tuple age per se... I don't fully understand the > treatment of locker and updater multixact IDs in the vacuum code, > HeapTupleSatisfiesVacuum and heap_freeze_tuple etc yet so I'm not sure > exactly how that value translates into vacuum work, but I can see > experimentally that a low multixact freeze min age is needed to get > relminxmid moved forward). > > It's good that freeze table age ramps down so that the autovacuum > launcher trigger point jumps around a bit and we spread the autovacuum > launches over time, but it's not great that we finish up truncating > different amounts of multixacts and associated SLRU each time. We > could instead use a freeze min age of 0 to force freezing of *all* > tuples if this is a member-space-wraparound-prevention vacuum (that > is, if autovacuum_multixact_freeze_max_age != > autovacuum_multixact_freeze_max_age_adjusted()). We already set vacuum_multixact_freeze_min_age to half of autovacuum_multixact_freeze_max_age so that autovacuums to prevent MultiXact wraparound won't occur too frequently as per below code: vacuum_set_xid_limits() { .. mxid_freezemin = Min(mxid_freezemin, autovacuum_multixact_freeze_max_age / 2); Assert(mxid_freezemin >= 0); .. } Now if we set it to zero, then I think it might lead to excessive freezing and inturn more I/O without the actual need (more space for multixact members) > > There is less to say about the results with an unpatched server: it > drives in a straight line for a while, and then crashes into a wall > (ie the new error preventing member wraparound), which I see you have > also reproduced. It's used up all of the circular member space, but > only has around 17 million multixacts so autovacuum can't help you > (it's not even possible to set autovacuum_multixact_freeze_max_age > below 100 million), so to get things moving again you need to manually > VACUUM FREEZE all databases including template databases. > In my tests on setting vacuum multixact parameter (vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age) values to zero, it has successfuly finished the tests (no warning and I could see truncation of files in members directory) , so I think one might argue that in many cases one could get the available space for members by just setting appropriate values for vacuum_multixact_* params, but I feel it is better to have some auto adjustment algorithm like this patch is trying to do so that even if those values are not set appropriately, it can avoid the wraparound error. I think the only thing we might need to be cautious about is that new calculation should not make it worse (less aggresive) in case of lower values for vacuum_multixact_* parameters. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Amit Kapila
Date:
On Sat, May 2, 2015 at 11:46 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Sat, May 2, 2015 at 7:08 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, May 1, 2015 at 6:51 AM, Thomas Munro > > <thomas.munro@enterprisedb.com> wrote: > >> Those other places are for capping the effective table and tuple > >> multixact freeze ages for manual vacuums, so that manual vacuums (say > >> in nightly cronjobs) get a chance to run a wraparound scans before > >> autovacuum kicks in at a less convenient time. So, yeah, I think we > >> want to incorporate member wraparound prevention into that logic, and > >> I will add that in the next version of the patch. > > > > +1. On a quick read-through of the patch, the biggest thing that > > jumped out at me was that it only touches the autovacuum logic. > > > Also attached is the output of the monitor.sh script posted upthread, > while running explode_mxact_members.c. It looks better than the last > results to me: whenever usage reaches 50%, autovacuum advances things > such that usage drops right back to 0% (because it now uses > multixact_freeze_min_age = 0) , and the system will happily chug on > forever. What this test doesn't really show adequately is that if you > had a lot of different tables and databases with different relminmxid > values, they'd be vacuumed at different times. I should probably come > up with a way to demonstrate that... > About data, I have extracted parts where there is a change in oldest_mxid and segments time segments usage_fraction usage_kb oldest_mxid next_mxid next_offset 13:48:36 1 0 16 1 1 0 13:49:36 369 .0044 94752 1 1 0 .. 14:44:04 41703 .5083 10713400 1 8528909 2140755909 14:45:05 1374 .0167 352960 8573819 8722521 2189352521 .. 15:37:16 41001 .4997 10529528 8573819 17060811 4282263311 .. 15:38:16 709 .0086 182056 17132168 17254423 35892627 .. 16:57:15 41440 .5051 10644712 17132168 25592713 2128803417 .. 16:58:16 1120 .0136 287416 25695507 25786824 2177525278 Based on this data, it seems that truncation of member space as well as advancement of oldest multixact id happens once it reaches 50% usage and at that time segments drops down to almost zero. This happens repeatedly after 1 hour and in-between there is no progress which indicates that all the work happens at one go rather than in spreaded way. Won't this choke the system when it happens due to I/O, isn't it better if we design it in a way such that it is spreaded over period of time rather than doing everything at one go? -- +int +compute_max_multixact_age_to_avoid_member_wrap(bool manual) { .. + if (members <= safe_member_count) + { + /* + * There is no danger of member wrap, so return a number that is not + * lower than autovacuum_multixact_freeze_max_age. + */ + return -1; + } .. The above code doesn't seem to match its comments. Comment says "..not lower than autovacuum_multixact_freeze_max_age", but then return -1. It seems to me here we should return unchanged autovacuum_multixact_freeze_max_age as it was coded in the initial version of patch. Do you have any specific reason to change it? With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Sun, May 3, 2015 at 4:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > About data, I have extracted parts where there is a change in > oldest_mxid and segments > > time segments usage_fraction usage_kb oldest_mxid next_mxid next_offset > > 13:48:36 1 0 16 1 1 0 > 13:49:36 369 .0044 94752 1 1 0 > .. > 14:44:04 41703 .5083 10713400 1 8528909 2140755909 > > 14:45:05 1374 .0167 352960 8573819 8722521 2189352521 > .. > 15:37:16 41001 .4997 10529528 8573819 17060811 4282263311 > .. > 15:38:16 709 .0086 182056 17132168 17254423 35892627 > .. > 16:57:15 41440 .5051 10644712 17132168 25592713 2128803417 > .. > 16:58:16 1120 .0136 287416 25695507 25786824 2177525278 > > Based on this data, it seems that truncation of member space > as well as advancement of oldest multixact id happens once > it reaches 50% usage and at that time segments drops down to almost > zero. This happens repeatedly after 1 hour and in-between there > is no progress which indicates that all the work happens at > one go rather than in spreaded way. Won't this choke the system > when it happens due to I/O, isn't it better if we design it in a way such > that it is spreaded over period of time rather than doing everything at > one go? At 50% usage it starts vacuuming tables that have the very oldest mxid in the system. That is, at 50% usage, we try to select the smallest non-zero fraction of tables that can be selected based on relminmxid alone, with any luck exactly one or close to it. As usage increases, we decrease the cutoff age slowly so we start selecting tables with slightly newer relminmxid values for vacuuming too. Only if it reaches 75% usage will it vacuum everything it can and eat all your IO. In a real system, I suppose there would be lots of big tables that need to be vacuumed, the vacuuming would take some time, and they would tend to have different relminmxids so that vacuuming would be effectively spread out by this selection algorithm. I think. Perhaps we should devise a test procedure to try to see if that happens. We'd need to create a bunch of big tables and modify monitor.sh to show the relminmxid for each of them so that you could see when they are being processed -- I will look into that. Restricting ourselves to selecting tables to vacuum using their relminmxid alone makes this patch small since autovacuum already works that way. We *could* introduce code that would be able to spread out the work of vacuuming tables that happen to have identical or very close relminmxid (say by introducing some non-determinism or doing something weird based on hashing table oids and the time to explicitly spread the start of processing over time, or <your idea here>), but I didn't want to propose anything too big/complicated/clever/stupid and I suspect that the relminmxid values will tend to diverge over time (but I could be wrong about that, if they all start at 1 and then move forward in lockstep over long periods of time then what I propose is not good enough... let's see if we can find out). > -- > +int > +compute_max_multixact_age_to_avoid_member_wrap(bool manual) > { > .. > + if (members <= safe_member_count) > + { > + /* > + * There is no danger of > member wrap, so return a number that is not > + * lower than autovacuum_multixact_freeze_max_age. > + > */ > + return -1; > + } > .. > > The above code doesn't seem to match its comments. > Comment says "..not lower than autovacuum_multixact_freeze_max_age", > but then return -1. It seems to me here we should return unchanged > autovacuum_multixact_freeze_max_age as it was coded in the initial > version of patch. Do you have any specific reason to change it? Oops, the comment is fixed in the attached patch. In an earlier version, I was only dealing with the autovacuum case. Now that the VACUUM command also calls it, I didn't want this compute_max_multixact_age_to_avoid_member_wrap function to assume that it was being called by autovacuum code and return the autovacuum-specific GUC in the case that no special action is needed. Also, the function no longer computes a value by scaling autovacuum_multixact_freeze_max_age, it now scales the current number of active multixacts, so that we can begin selecting a small non-zero number of tables to vacuum as soon as we exceed safe_member_count as described above (whereas when we used a scaled down autovaccum_multixact_freeze_max_age, we usually didn't select any tables at all until we scaled it down a lot, ie until we got close to dangerous_member_count). Finally, I wanted a special value like -1 for 'none' so that table_recheck_autovac and ExecVacuum could use a simple test >= 0 to know that they also need to set multixact_freeze_min_age to zero in the case of a member-space-triggered vacuum, so that we get maximum benefit from our table scans by freezing all relevant tuples, not just some older ones (that's why you see usage drop to almost zero each time, whereas the monitor.sh results I showed from the earlier patch trimmed usage by varying amounts, which meant that autovacuum wraparounds would need to be done again sooner). Does that make sense? -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Amit Kapila
Date:
On Mon, May 4, 2015 at 5:19 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Sun, May 3, 2015 at 4:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > -- > > +int > > +compute_max_multixact_age_to_avoid_member_wrap(bool manual) > > { > > .. > > + if (members <= safe_member_count) > > + { > > + /* > > + * There is no danger of > > member wrap, so return a number that is not > > + * lower than autovacuum_multixact_freeze_max_age. > > + > > */ > > + return -1; > > + } > > .. > > > > The above code doesn't seem to match its comments. > > Comment says "..not lower than autovacuum_multixact_freeze_max_age", > > but then return -1. It seems to me here we should return unchanged > > autovacuum_multixact_freeze_max_age as it was coded in the initial > > version of patch. Do you have any specific reason to change it? > > Oops, the comment is fixed in the attached patch. > > In an earlier version, I was only dealing with the autovacuum case. > Now that the VACUUM command also calls it, I didn't want this > compute_max_multixact_age_to_avoid_member_wrap function to assume that > it was being called by autovacuum code and return the > autovacuum-specific GUC in the case that no special action is needed. > Also, the function no longer computes a value by scaling > autovacuum_multixact_freeze_max_age, it now scales the current number > of active multixacts, so that we can begin selecting a small non-zero > number of tables to vacuum as soon as we exceed safe_member_count as > described above I am slightly worried that if for scaling we don't consider the value for multixact_*_age as configured by user, Vacuum/Autovacuum might behave totally different from what user is expecting. Basically it will be dominated based on member space usage and will ignore the values set by user for multixact_*_age parameters. One way could be to use minimum of the value calculated based on member space and the value specified by user for multixact related parameters as suggested in points 1 and 2 (below in mail). One more thing, I think the current calculation considers members usage, shouldn't we try to consider offset usage as well? > (whereas when we used a scaled down > autovaccum_multixact_freeze_max_age, we usually didn't select any > tables at all until we scaled it down a lot, ie until we got close to > dangerous_member_count). Finally, I wanted a special value like -1 > for 'none' so that table_recheck_autovac and ExecVacuum could use a > simple test >= 0 to know that they also need to set > multixact_freeze_min_age to zero in the case of a > member-space-triggered vacuum, so that we get maximum benefit from our > table scans by freezing all relevant tuples, not just some older ones > I think setting multixact_freeze_min_age to zero could be too aggresive for I/O. Yes with this you can get maximum benefit, but at cost of increased I/O. How would you justify setting it to zero as appropriate w.r.t increased I/O? Few more observations: 1. @@ -2687,6 +2796,10 @@ relation_needs_vacanalyze(Oid relid, ? Min(relopts- >multixact_freeze_max_age, autovacuum_multixact_freeze_max_age) : autovacuum_multixact_freeze_max_age; + /* Special settings if we are running out of member address space. */ + if (max_multixact_age_to_avoid_member_wrap >= 0) + multixact_freeze_max_age = max_multixact_age_to_avoid_member_wrap; + Isn't it better to use minimum to already computed value of multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap? multixact_freeze_max_age = Min(multixact_freeze_max_age, max_multixact_age_to_avoid_member_wrap); Similar change needs to be done in table_recheck_autovac() 2. @@ -1118,7 +1197,12 @@ do_start_worker(void) /* Also determine the oldest datminmxid we will consider. */ recentMulti = ReadNextMultiXactId(); - multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age; + max_multixact_age_to_avoid_member_wrap = + compute_max_multixact_age_to_avoid_member_wrap(false); + if (max_multixact_age_to_avoid_member_wrap >= 0) + multiForceLimit = recentMulti - max_multixact_age_to_avoid_member_wrap; + else + multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age; Here also, isn't it better to use minimum of autovacuum_multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap. 3. +int +compute_max_multixact_age_to_avoid_member_wrap(bool manual) +{ + MultiXactOffset members; + uint32 multixacts; + double fraction; + MultiXactOffset safe_member_count = MaxMultiXactOffset / 2; It is not completely clear what is more appropriate value for safe_member_count (25% or 50%). Anybody else have any opinion on this value? 4. Once we conclude on final algorithm, we should update the same in docs as well, probably in description at below link: http://www.postgresql.org/docs/devel/static/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Mon, May 4, 2015 at 11:49 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sun, May 3, 2015 at 4:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> Based on this data, it seems that truncation of member space >> as well as advancement of oldest multixact id happens once >> it reaches 50% usage and at that time segments drops down to almost >> zero. This happens repeatedly after 1 hour and in-between there >> is no progress which indicates that all the work happens at >> one go rather than in spreaded way. Won't this choke the system >> when it happens due to I/O, isn't it better if we design it in a way such >> that it is spreaded over period of time rather than doing everything at >> one go? > > At 50% usage it starts vacuuming tables that have the very oldest mxid > in the system. That is, at 50% usage, we try to select the smallest > non-zero fraction of tables that can be selected based on relminmxid > alone, with any luck exactly one or close to it. As usage increases, > we decrease the cutoff age slowly so we start selecting tables with > slightly newer relminmxid values for vacuuming too. Only if it > reaches 75% usage will it vacuum everything it can and eat all your > IO. In a real system, I suppose there would be lots of big tables > that need to be vacuumed, the vacuuming would take some time, and they > would tend to have different relminmxids so that vacuuming would be > effectively spread out by this selection algorithm. I think. Perhaps > we should devise a test procedure to try to see if that happens. We'd > need to create a bunch of big tables and modify monitor.sh to show the > relminmxid for each of them so that you could see when they are being > processed -- I will look into that. After initdb, I did: postgres=# create table foo as select 42 as id; SELECT 1 postgres=# create table cat_a as select generate_series(1, 10000000); SELECT 10000000 postgres=# create table cat_b as select generate_series(1, 10000000); SELECT 10000000 postgres=# create table cat_c as select generate_series(1, 10000000); SELECT 10000000 postgres=# create table cat_d as select generate_series(1, 10000000); SELECT 10000000 Then I started monitor.sh (new version attached) and started explode_mxact_members.c and recorded the attached file monitor-output.txt. The result is not great: since relminmxid starts at 1 for all tables created in a brand new database, 3 (autovacuum_max_workers) vacuums started around the same time, and later they didn't really seem to be diverging, a bit under an hour later the same three were triggered again, and so on. I have also attached monitor-output-fail.txt, a case where the database size is too large to be vacuumed fast enough for that rate of member space consumption. It starts vacuuming with 3 workers at 50%, and hasn't finished by 100%, so the new error is raised and no more multixacts can be created. Eventually the vacuuming completes and progress can be made. The database was set up as above, but the tables have 10 times more rows. > Restricting ourselves to selecting tables to vacuum using their > relminmxid alone makes this patch small since autovacuum already works > that way. We *could* introduce code that would be able to spread out > the work of vacuuming tables that happen to have identical or very > close relminmxid (say by introducing some non-determinism or doing > something weird based on hashing table oids and the time to explicitly > spread the start of processing over time, or <your idea here>), but I > didn't want to propose anything too big/complicated/clever/stupid and > I suspect that the relminmxid values will tend to diverge over time > (but I could be wrong about that, if they all start at 1 and then move > forward in lockstep over long periods of time then what I propose is > not good enough... let's see if we can find out). Maybe we do need to consider something more radical, since tables that have the same relminmxid are fairly easy to produce (for example by restoring a database dump into a new database). I will try to do something that adds some noise to the signal to deal with this case (of the top of my head, something along the lines of time_in_minutes % 16 == table_oid %16..., or adjust the multixact cutoff age with a recipe including table_oid % something, or ... not sure yet). If anyone has any better ideas, I am all ears. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Mon, May 4, 2015 at 6:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > [...] > One more thing, I think the current calculation considers members > usage, shouldn't we try to consider offset usage as well? Offsets are indexed by multixact ID: #define MultiXactIdToOffsetPage(xid) \ ((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE) #define MultiXactIdToOffsetEntry(xid) \ ((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE) The existing multixact wraparound prevention code is already managing the 32 bit multixact ID space. The problem with members comes about because each one of those multixact IDs can have arbitrary numbers of members, and yet the members are also addressed with a 32 bit index. So we are trying to hijack the multixact ID wraparound prevention and make it more aggressive if member space appears to be running out. (Perhaps in future there should be a 64 bit index for member indexes so that this problem disappears?) >> (whereas when we used a scaled down >> autovaccum_multixact_freeze_max_age, we usually didn't select any >> tables at all until we scaled it down a lot, ie until we got close to >> dangerous_member_count). Finally, I wanted a special value like -1 >> for 'none' so that table_recheck_autovac and ExecVacuum could use a >> simple test >= 0 to know that they also need to set >> multixact_freeze_min_age to zero in the case of a >> member-space-triggered vacuum, so that we get maximum benefit from our >> table scans by freezing all relevant tuples, not just some older ones >> > > I think setting multixact_freeze_min_age to zero could be too aggresive > for I/O. Yes with this you can get maximum benefit, but at cost of > increased I/O. How would you justify setting it to zero as appropriate > w.r.t increased I/O? I assumed that if you were already vacuuming all your tablesto avoid running out of member space, you would want to freeze any tuples you possibly could to defer the next wraparound scan for as long as possible, since wraparound scans are enormously expensive. > Few more observations: > > 1. > @@ -2687,6 +2796,10 @@ relation_needs_vacanalyze(Oid relid, > ? Min(relopts- >>multixact_freeze_max_age, autovacuum_multixact_freeze_max_age) > : > autovacuum_multixact_freeze_max_age; > > + /* Special settings if we are running out of member address space. > */ > + if (max_multixact_age_to_avoid_member_wrap >= 0) > + multixact_freeze_max_age = > max_multixact_age_to_avoid_member_wrap; > + > > Isn't it better to use minimum to already computed value of > multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap? > > multixact_freeze_max_age = Min(multixact_freeze_max_age, > max_multixact_age_to_avoid_member_wrap); Except that I am using -1 as a special value. But you're right, I guess it should be like this: if (max_multixact_age_to_avoid_member_wrap >= 0) multixact_freeze_max_age = Min(multixact_freeze_max_age, max_multixact_age_to_avoid_member_wrap); > Similar change needs to be done in table_recheck_autovac() > > 2. > @@ -1118,7 +1197,12 @@ do_start_worker(void) > > /* Also determine the oldest datminmxid we will consider. */ > recentMulti = ReadNextMultiXactId(); > - multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age; > + max_multixact_age_to_avoid_member_wrap = > + compute_max_multixact_age_to_avoid_member_wrap(false); > + if (max_multixact_age_to_avoid_member_wrap >= 0) > + multiForceLimit = recentMulti - max_multixact_age_to_avoid_member_wrap; > + else > + multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age; > > Here also, isn't it better to use minimum of > autovacuum_multixact_freeze_max_age > and max_multixact_age_to_avoid_member_wrap. Yeah, with the same proviso about -1. > 3. > +int > +compute_max_multixact_age_to_avoid_member_wrap(bool manual) > +{ > + MultiXactOffset members; > + uint32 multixacts; > + double fraction; > + MultiXactOffset safe_member_count = MaxMultiXactOffset / 2; > > It is not completely clear what is more appropriate value > for safe_member_count (25% or 50%). Anybody else have any > opinion on this value? > > 4. Once we conclude on final algorithm, we should update the > same in docs as well, probably in description at below link: > http://www.postgresql.org/docs/devel/static/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND Agreed. -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Amit Kapila
Date:
On Mon, May 4, 2015 at 12:42 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Mon, May 4, 2015 at 6:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > [...] > > One more thing, I think the current calculation considers members > > usage, shouldn't we try to consider offset usage as well? > > Offsets are indexed by multixact ID: > > #define MultiXactIdToOffsetPage(xid) \ > ((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE) > #define MultiXactIdToOffsetEntry(xid) \ > ((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE) > > The existing multixact wraparound prevention code is already managing > the 32 bit multixact ID space. The problem with members comes about > because each one of those multixact IDs can have arbitrary numbers of > members, and yet the members are also addressed with a 32 bit index. > So we are trying to hijack the multixact ID wraparound prevention and > make it more aggressive if member space appears to be running out. > (Perhaps in future there should be a 64 bit index for member indexes > so that this problem disappears?) > Okay, that makes sense. > >> (whereas when we used a scaled down > >> autovaccum_multixact_freeze_max_age, we usually didn't select any > >> tables at all until we scaled it down a lot, ie until we got close to > >> dangerous_member_count). Finally, I wanted a special value like -1 > >> for 'none' so that table_recheck_autovac and ExecVacuum could use a > >> simple test >= 0 to know that they also need to set > >> multixact_freeze_min_age to zero in the case of a > >> member-space-triggered vacuum, so that we get maximum benefit from our > >> table scans by freezing all relevant tuples, not just some older ones > >> > > > > I think setting multixact_freeze_min_age to zero could be too aggresive > > for I/O. Yes with this you can get maximum benefit, but at cost of > > increased I/O. How would you justify setting it to zero as appropriate > > w.r.t increased I/O? > > I assumed that if you were already vacuuming all your tablesto avoid > running out of member space, I think here you mean all tables that has relminmxid lesser than the newly computed age (compute_max_multixact_age_to_avoid_member_wrap) > you would want to freeze any tuples you > possibly could to defer the next wraparound scan for as long as > possible, since wraparound scans are enormously expensive. > The point is valid to an extent, but If we go by this logic, then currently also we should set multixact_freeze_min_age as zero for wraparound vacuum. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Sat, May 2, 2015 at 2:16 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here's a new version which sets up the multixact parameters in > ExecVacuum for regular VACUUM commands just like it does for > autovacuum if needed. When computing > max_multixact_age_to_avoid_member_wrap for a manual vacuum, it uses > lower constants, so that any manually scheduled vacuums get a chance > to deal with some of this problem before autovacuum has to. Here are > the arbitrary constants currently used: at 50% member address space > usage, autovacuum starts wraparound scan of tables with the oldest > active multixacts, and then younger ones as the usage increases, until > at 75% usage it vacuums with multixact_freeze_table_age = 0; for > manual VACUUM those numbers are halved so that it has a good head > start. I think the 75% threshold for reducing multxact_freeze_table_age to zero is fine, but I don't agree with the 50% cutoff. The purpose of autovacuum_multixact_freeze_max_age is to control the fraction of the 2^32-entry offset space that can be consumed before we begin viewing the problem as urgent. We have a setting for that because it needs to be tunable, and the default value for that setting is 400 million, which is roughly 10% of the members space. That is a whole lot lower than the 50% threshold you are proposing here. Moreover, it leaves the user with no meaningful choice: if the 50% threshold consumes too much disk space, or doesn't leave enough room before we hit the wall, then the user is simply hosed. This is why I initially proposed that the member-space-consumption-percentage at which we start derating multixact_freeze_table_age should be based on autovacuum_multixact_freeze_max_age/2^32. That way, autovacuum_multixact_freeze_max_age controls not only how aggressively we try to reclaim offset space but also how aggressively we try to reclaim member space. The user can then tune the value, and the default is the same in both cases. I also think that halving the numbers for manual vacuums is arbitrary and unprecedented. The thought process isn't bad, but an autovacuum currently behaves in most respects like a manual vacuum, and I'm reluctant to make those more different. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Sat, May 2, 2015 at 7:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> 3. When the autovacuum runs, it advances oldest_mxid by different >> amounts each time; that's because I'm using the adjusted freeze max >> age (the max age of a table before it gets a wraparound vacuum) as our >> freeze min age (the max age for individual tuples before they're >> frozen) here: >> >> @@ -1931,7 +1964,9 @@ do_autovacuum(void) >> { >> default_freeze_min_age = vacuum_freeze_min_age; >> default_freeze_table_age = vacuum_freeze_table_age; >> - default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age; >> + default_multixact_freeze_min_age = >> + Min(vacuum_multixact_freeze_min_age, >> + autovacuum_multixact_freeze_max_age_adjusted()); >> default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age; >> } >> >> Without that change, autovacuum would trigger repeatedly as we got >> near 75% SLRU usage but not freeze anything, because >> default_multixact_freeze_min_age was higher than the age of any tuples >> (which had only made it to an age of around ~12 million; actually it's >> not exactly the tuple age per se... I don't fully understand the >> treatment of locker and updater multixact IDs in the vacuum code, >> HeapTupleSatisfiesVacuum and heap_freeze_tuple etc yet so I'm not sure >> exactly how that value translates into vacuum work, but I can see >> experimentally that a low multixact freeze min age is needed to get >> relminxmid moved forward). >> >> It's good that freeze table age ramps down so that the autovacuum >> launcher trigger point jumps around a bit and we spread the autovacuum >> launches over time, but it's not great that we finish up truncating >> different amounts of multixacts and associated SLRU each time. We >> could instead use a freeze min age of 0 to force freezing of *all* >> tuples if this is a member-space-wraparound-prevention vacuum (that >> is, if autovacuum_multixact_freeze_max_age != >> autovacuum_multixact_freeze_max_age_adjusted()). > > We already set vacuum_multixact_freeze_min_age to half of > autovacuum_multixact_freeze_max_age so that autovacuums to > prevent MultiXact wraparound won't occur too frequently as per below > code: > > vacuum_set_xid_limits() > { > .. > mxid_freezemin = Min(mxid_freezemin, > > autovacuum_multixact_freeze_max_age / 2); > Assert(mxid_freezemin >= 0); > .. > } > > Now if we set it to zero, then I think it might lead to excessive > freezing and inturn more I/O without the actual need (more space > for multixact members) That point is certainly worthy of some consideration. Letting the freeze xmin get set to half of the (effective) autovacuum_multixact_freeze_age would certainly be more consistent with what we do elsewhere. The policy trade-off is not as straightforward as you are making it out to be, though: 1. Using a min freeze age of zero will result in half as many full-table scans, because we'll advance relminmxid twice as far each time. 2. But each one will freeze more stuff, some of which might have been updated again before the next freeze pass, so we might do more freezing in total. So either policy might win, depending on whether you care more about reducing reads (in which case you want a very low min freeze age) or about reducing writes (in which case you want a higher min freeze age). All things being equal, I'd rather stick with the existing 50% policy in the back-branches, rather than going to zero, but I'm not sure all things are equal. It matters what difference the higher value makes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Sun, May 3, 2015 at 7:49 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Restricting ourselves to selecting tables to vacuum using their > relminmxid alone makes this patch small since autovacuum already works > that way. We *could* introduce code that would be able to spread out > the work of vacuuming tables that happen to have identical or very > close relminmxid (say by introducing some non-determinism or doing > something weird based on hashing table oids and the time to explicitly > spread the start of processing over time, or <your idea here>), but I > didn't want to propose anything too big/complicated/clever/stupid and > I suspect that the relminmxid values will tend to diverge over time > (but I could be wrong about that, if they all start at 1 and then move > forward in lockstep over long periods of time then what I propose is > not good enough... let's see if we can find out). So, the problem of everything moving in lockstep is one we already have. It's actually a serious operational problem for relfrozenxid, because you might restore your database from pg_dump or similar and every table will have a very similar relfrozenxid and so then the anti-wraparound logic fires for all of them at the same time. There might be cases where MXIDs behave the same way, although I would think it would be less common. Anyway, solving that problem would be nice (particularly for xmin!), but we shouldn't get into that with relation to this bug fix. It's a problem, but one that will probably take a good deal of work to solve, and certainly not something we would back-patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Thomas Munro wrote: > FWIW, in some future release, I think we should consider getting a > bigger multixact member address space that wraps around at 2^48 or > 2^64 instead of 2^32, so that we can sidestep the whole business and > go back to having just xid and mxid wraparounds to worry about. > pg_multixact/offsets would be 50% or 100% bigger (an extra byte or two > per multixact), but it's not very big. pg_multiact/members would be > no bigger for any workload that currently works without hitting the > wraparound error, but could grow bigger if needed. Not sure that enlarging the addressable area to 48/64 bits is feasible, TBH. We already have many complaints that multixacts take too much disk space; we don't want to make that 2^32 times worse, not even 2^16 times worse. I don't understand why you say it'd become 1 byte bigger per multixact; it would have to be 4 more bytes (2^64) or 2 more bytes (2^48), no? If you have 150 million multixacts (the default freeze table age) that would mean about 300 or 600 MB of additional disk space, which is not insignificant: with the current system, in an database with normal multixact usage of 4 members per multixact, members/ would use about 2.8 GB, so 600 additional MB in offsets/ is large enough growth to raise some more complaints. (The 2^48 suggestion might be a tad more difficult to implement, note, becase a lot of stuff relies on unsigned integer wraparound addition, and I'm not sure we can have that with a 2^48 counter. Maybe we could figure how to make it work, but is it worth the bother?) -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Robert Haas wrote: > Anyway, solving that problem would be nice (particularly for xmin!), > but we shouldn't get into that with relation to this bug fix. It's a > problem, but one that will probably take a good deal of work to solve, > and certainly not something we would back-patch. +1 -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Kevin Grittner
Date:
Robert Haas <robertmhaas@gmail.com> wrote: > 1. Using a min freeze age of zero will result in half as many > full-table scans, because we'll advance relminmxid twice as far > each time. > > 2. But each one will freeze more stuff, some of which might have > been updated again before the next freeze pass, so we might do > more freezing in total. > > So either policy might win, depending on whether you care more > about reducing reads (in which case you want a very low min > freeze age) or about reducing writes (in which case you want a > higher min freeze age). > > All things being equal, I'd rather stick with the existing 50% > policy in the back-branches, rather than going to zero, but I'm > not sure all things are equal. It matters what difference the > higher value makes. I really don't like the "honor the configured value of vacuum_multixact_freeze_min_age until the members SLRU gets to 50% of wraparound and then use zero" approach. It made a lot more sense to me to honor the configured value to 25% and decrease it in a linear fashion until it hit zero at 75%. It seems like maybe we weren't aggressive enough in the dynamic adjustment of autovacuum_multixact_freeze_max_age, but I'm not clear why fixing that required the less gradual adjustment of the *_min_age setting. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
So I might have understood an earlier description of the proposed solution all wrong, or this patch was designed without consideration to that description. What I thought would happen is that all freeze ages would get multiplied by some factor <= 1, depending on the space used up by members. If members space usage is low enough, factor would remain at 1 so things would behave as today. If members space usage is larger than X, the factor decreases smoothly and this makes freeze_min_age and freeze_max_age decrease smoothly as well, for all vacuums equally. For instance, we could choose a method to compute X based on considering that a full 2^32 storage area for members is enough to store one vacuum_multixact_freeze_table_age cycle of multixacts. The default value of this param is 150 million, and 2^32/150000000 = 28; so if your average multixact size = 38, you would set the multiplier at 0.736 and your effective freeze_table_age would become 110 million and effective freeze_min_age would become 3.68 million. As a secondary point, I find variable-names-as-documentation bad practice. Please don't use a long name such as max_multixact_age_to_avoid_member_wrap; code becomes unwieldy. A short name such as safe_mxact_age preceded by a comment /* this variable is the max that avoids member wrap */ seems more palatable; side-by-side merges and all that! I don't think long function names are as problematic (though the name of your new function is still a bit too long). Please note that 9.4 and earlier do not have ExecVacuum; the determination of freeze ages is done partly in gram.y (yuck). Not sure what will the patch look like in those branches. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Alvaro Herrera wrote: > For instance, we could choose a method to compute X based on considering > that a full 2^32 storage area for members is enough to store one > vacuum_multixact_freeze_table_age cycle of multixacts. The default > value of this param is 150 million, and 2^32/150000000 = 28; so if your > average multixact size = 38, you would set the multiplier at 0.736 and > your effective freeze_table_age would become 110 million and effective > freeze_min_age would become 3.68 million. Actually, apologies --- this is not what I was thinking at all. I got distracted while I was writing the previous email. My thinking was that the values would be at their normal defaults when the wraparound is distant, and the multiplier would start to become slightly less than 1 as the counter moves towards wraparound; by the time we're at at an emergency i.e. we reach max_freeze_age, the values naturally become zero (or perhaps just before we reach max_freeze_age, the values were 50% of their normal values, so the drop to zero is not as dramatic). Since this is gradual, the behavior is not as jumpy as in the proposed patch. Anyway this is in line with what Kevin is saying elsewhere: we shouldn't just use the normal values all the time just up to the freeze_max_age point; there should be some gradual ramp-up. Perhaps we can combine this with the other idea of using a multiplier connected to average size of multixact, if it doesn't become too complicated, surprising, or both. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Tue, May 5, 2015 at 8:36 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Thomas Munro wrote: > >> FWIW, in some future release, I think we should consider getting a >> bigger multixact member address space that wraps around at 2^48 or >> 2^64 instead of 2^32, so that we can sidestep the whole business and >> go back to having just xid and mxid wraparounds to worry about. >> pg_multixact/offsets would be 50% or 100% bigger (an extra byte or two >> per multixact), but it's not very big. pg_multiact/members would be >> no bigger for any workload that currently works without hitting the >> wraparound error, but could grow bigger if needed. > > Not sure that enlarging the addressable area to 48/64 bits is feasible, > TBH. We already have many complaints that multixacts take too much disk > space; we don't want to make that 2^32 times worse, not even 2^16 times > worse. I don't understand why you say it'd become 1 byte bigger per > multixact; it would have to be 4 more bytes (2^64) or 2 more bytes > (2^48), no? If you have 150 million multixacts (the default freeze > table age) that would mean about 300 or 600 MB of additional disk space, > which is not insignificant: with the current system, in an database with > normal multixact usage of 4 members per multixact, members/ would use > about 2.8 GB, so 600 additional MB in offsets/ is large enough growth to > raise some more complaints. Right, sorry, I must have been thinking of 40 bit or 48 bit indexes when I said 1 or 2 bytes. I can't help thinking there must be a different way to do this that takes advantage of the fact that multixacts are often created by copying all the members of an existing multixact and adding one new one, so that there is a lot of duplication and churn (at least when you have a workload that generates bigger multixacts, due to the O(n^2) process of building them up xid by xid). Maybe there is a way to store a pointer to some other multixact + a new xid in a chain structure, but I don't see how to do the cleanup when you have active multixacts with backwards references to older multixacts. Maybe you could find some way to leave gaps in member space (perhaps by making member index point to member groups with space for 4 or 8 member xids), and MultiXactIdExpand could create new multixacts that point to the same member offset but a different size so they see extra members, but that would also waste disk space, be hard to synchronize and you'd need to fall back to copying the members into new member space when the spare space is filled anyway.
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Thomas Munro wrote: > I can't help thinking there must be a different way to do this that > takes advantage of the fact that multixacts are often created by > copying all the members of an existing multixact and adding one new > one, so that there is a lot of duplication and churn (at least when > you have a workload that generates bigger multixacts, due to the > O(n^2) process of building them up xid by xid). Yeah, Simon expressed the same thought to me some months ago, and I gave it some think-time (but not at lot of it TBH). I didn't see any way to make it workable. Normally, lockers go away reasonably quickly, so some of the original members of the multixact are disappearing all the time. Maybe one way would be to re-use a multixact you have in your local cache, as long as the only difference with the multixact you want is some locker transaction(s) that have already ended. Not sure how you would manage the cache, though. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Amit Kapila
Date:
On Tue, May 5, 2015 at 2:29 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > > Please note that 9.4 and earlier do not have ExecVacuum; the > determination of freeze ages is done partly in gram.y (yuck). Not sure > what will the patch look like in those branches. > One way to make fix back-patchable is to consider doing the changes for Vacuum and AutoVacuum in one common path (vacuum_set_xid_limits())? However, I think we might need to distinguish whether the call is from Vacuum or AutoVacuum path. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
Here's a new patch, with responses to several reviews. On Tue, May 5, 2015 at 8:41 AM, Kevin Grittner <kgrittn@ymail.com> wrote: > I really don't like the "honor the configured value of > vacuum_multixact_freeze_min_age until the members SLRU gets to 50% > of wraparound and then use zero" approach. It made a lot more > sense to me to honor the configured value to 25% and decrease it in > a linear fashion until it hit zero at 75%. It seems like maybe we > weren't aggressive enough in the dynamic adjustment of > autovacuum_multixact_freeze_max_age, but I'm not clear why fixing > that required the less gradual adjustment of the *_min_age setting. Ok, the new patch uses 25% as the safe threshold, and then scales multixact_freeze_table_age down from the current number of active multixacts (ie to select the minimum number of tables) progressively to 0 (to select all tables) when you reach 75% usage. It also sets multixact_freeze_min_age to half of the value it used for multixact_table_age, so that reduces progressively too. But see my response to Robert below for discussion (question) about how to make safe threshold depend on autovacuum_multixact_freeze_max_age. On Tue, May 5, 2015 at 8:37 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> Anyway, solving that problem would be nice (particularly for xmin!), >> but we shouldn't get into that with relation to this bug fix. It's a >> problem, but one that will probably take a good deal of work to solve, >> and certainly not something we would back-patch. > > +1 OK, not attempting to do anything about that. If your database has a bunch of tables with the same relminmxid (for example after a recent restore from pg_dump), you're going to get vacuums starting at the same time. If they're more scrambled up, you'll get your vacuum work spread out better, and that is just as it is for normal mxid wraparound prevention so that is OK for the purposes of this patch. On Tue, May 5, 2015 at 6:46 AM, Robert Haas <robertmhaas@gmail.com> wrote: > I think the 75% threshold for reducing multxact_freeze_table_age to > zero is fine, but I don't agree with the 50% cutoff. The purpose of > autovacuum_multixact_freeze_max_age is to control the fraction of the > 2^32-entry offset space that can be consumed before we begin viewing > the problem as urgent. We have a setting for that because it needs to > be tunable, and the default value for that setting is 400 million, > which is roughly 10% of the members space. That is a whole lot lower > than the 50% threshold you are proposing here. Moreover, it leaves > the user with no meaningful choice: if the 50% threshold consumes too > much disk space, or doesn't leave enough room before we hit the wall, > then the user is simply hosed. This is why I initially proposed that > the member-space-consumption-percentage at which we start derating > multixact_freeze_table_age should be based on > autovacuum_multixact_freeze_max_age/2^32. That way, > autovacuum_multixact_freeze_max_age controls not only how aggressively > we try to reclaim offset space but also how aggressively we try to > reclaim member space. The user can then tune the value, and the > default is the same in both cases. Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts before wraparound vacuum, which is ~10% of 2^32, we would interpret that to mean 400 million multixacts OR ~10% * some_constant of member space, in other worlds autovacuum_freeze_max_age * some_constant members, whichever comes first. But what should some_constant be? If it's 1, then we are only allowed 400 million members for our 400 million multixacts, which is not enough, if I am right in assuming that multixacts (nearly?) always have 2 or more members: a setting of 400 million would trigger wraparound at 200 million multixacts (or less) for *everyone* and trigger wraparound scans for some users as soon as they upgrade. If it's 2, then anyone exceeding an average of 2 members would get their effective autovacuum_freeze_max_age lowered. If it's 8, then people using the default max age of 400 million will hit the 75% usage threshold that we have decided should trigger freeze-everything-mode just as they hit the cutoff at which we start doing anything, and for people who have raised the default value it would be after (ahh, I might have the tests the wrong way around for this scenario in MultiXactCheckMemberUsage() in the current patch...). What do we know about expected multixact sizes in the wild, and what sort of values people tend to set their autovacuum_multixact_freeze_max_age to? Does anyone have a principled suggestion for some_constant? > I also think that halving the numbers for manual vacuums is arbitrary > and unprecedented. The thought process isn't bad, but an autovacuum > currently behaves in most respects like a manual vacuum, and I'm > reluctant to make those more different. Ok, removed. On Tue, May 5, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > the values would be at their normal defaults when the wraparound is > distant, and the multiplier would start to become slightly less than 1 > as the counter moves towards wraparound; by the time we're at at an > emergency i.e. we reach max_freeze_age, the values naturally become zero > (or perhaps just before we reach max_freeze_age, the values were 50% of > their normal values, so the drop to zero is not as dramatic). Since > this is gradual, the behavior is not as jumpy as in the proposed patch. > > Anyway this is in line with what Kevin is saying elsewhere: we shouldn't > just use the normal values all the time just up to the freeze_max_age > point; there should be some gradual ramp-up. Ok, in the new patch multixact_freeze_min_age is ramped. > Perhaps we can combine this with the other idea of using a multiplier > connected to average size of multixact, if it doesn't become too > complicated, surprising, or both. What bothers me about this is that the average multixact size may suddenly change, for example when new application code rolls out or a nightly job runs, leading to a sudden change in member space consumption, you might get in trouble by trying to be too clever with predictions instead of just using something really simple based on thresholds. But I might not be understanding what you meant... On Mon, May 4, 2015 at 6:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote: > I think setting multixact_freeze_min_age to zero could be too aggresive > for I/O. Yes with this you can get maximum benefit, but at cost of > increased I/O. How would you justify setting it to zero as appropriate > w.r.t increased I/O? Changed back to ramped, see above. > Isn't it better to use minimum to already computed value of > multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap? Ok, done. On Tue, May 5, 2015 at 8:59 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > As a secondary point, I find variable-names-as-documentation bad > practice. Please don't use a long name such as > max_multixact_age_to_avoid_member_wrap; code becomes unwieldy. A short > name such as safe_mxact_age preceded by a comment /* this variable is > the max that avoids member wrap */ seems more palatable; side-by-side > merges and all that! I don't think long function names are as > problematic (though the name of your new function is still a bit too > long). I changed the function name to MultiXactCheckMemberUsage and moved into multixact.c, and changed the variable name in various places to safe_multixact_age. There were already a few comments explaining its purpose. > Please note that 9.4 and earlier do not have ExecVacuum; the > determination of freeze ages is done partly in gram.y (yuck). Not sure > what will the patch look like in those branches. Autovacuum seems to parameterize vacuum via parser statements (!) so why not just do the same if we detect that we arrived in vacuum via a user command, and it wasn't a VACUUM FREEZE? In that case I think the values arrive as -1. Like in the attached patch against 9.3 (not tested). -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Amit Kapila wrote: > On Tue, May 5, 2015 at 2:29 AM, Alvaro Herrera <alvherre@2ndquadrant.com> > wrote: > > > > > > Please note that 9.4 and earlier do not have ExecVacuum; the > > determination of freeze ages is done partly in gram.y (yuck). Not sure > > what will the patch look like in those branches. > > One way to make fix back-patchable is to consider doing the changes > for Vacuum and AutoVacuum in one common path (vacuum_set_xid_limits())? > However, I think we might need to distinguish whether the call is from > Vacuum or AutoVacuum path. I think it's easier if we just adjust the patch in older branches to affect the code that now lives in ExecVacuum. Trying to make all branches the same will probably make the whole thing more complicated, for no real purpose. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Tue, May 5, 2015 at 3:58 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Ok, the new patch uses 25% as the safe threshold, and then scales > multixact_freeze_table_age down from the current number of active > multixacts (ie to select the minimum number of tables) progressively > to 0 (to select all tables) when you reach 75% usage. I definitely think that 25% is better than 50%. But see below. > Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts > before wraparound vacuum, which is ~10% of 2^32, we would interpret > that to mean 400 million multixacts OR ~10% * some_constant of member > space, in other worlds autovacuum_freeze_max_age * some_constant > members, whichever comes first. But what should some_constant be? some_constant should be all the member space there is. So we trigger autovac if we've used more than ~10% of the offsets OR more than ~10% of the members. Why is autovacuum_multixact_freeze_max_age configurable in the place? It's configurable so that you can set it low enough that wraparound scans complete and advance the minmxid before you hit the wall, but high enough to avoid excessive scanning. The only problem is that it only lets you configure the amount of headroom you need for offsets, not members. If you squint at what I'm proposing the right way, it's essentially that that GUC should control both of those things. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Wed, May 6, 2015 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 5, 2015 at 3:58 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >> Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts >> before wraparound vacuum, which is ~10% of 2^32, we would interpret >> that to mean 400 million multixacts OR ~10% * some_constant of member >> space, in other worlds autovacuum_freeze_max_age * some_constant >> members, whichever comes first. But what should some_constant be? > > some_constant should be all the member space there is. So we trigger > autovac if we've used more than ~10% of the offsets OR more than ~10% > of the members. Why is autovacuum_multixact_freeze_max_age > configurable in the place? It's configurable so that you can set it > low enough that wraparound scans complete and advance the minmxid > before you hit the wall, but high enough to avoid excessive scanning. > The only problem is that it only lets you configure the amount of > headroom you need for offsets, not members. If you squint at what I'm > proposing the right way, it's essentially that that GUC should control > both of those things. But member space *always* grows at least twice as fast as offset space (aka active multixact IDs), because multixacts always have at least 2 members (except in some rare cases IIUC), don't they? So if we do what you just said, then we'll trigger wraparound vacuums twice as soon as we do now for everybody, even people who don't have any problem with member space management. We don't want this patch to change anything for most people, let alone everyone. So I think that some_constant should be at least 2, if we try to do it this way, in other words if you set the GUC for 10% of offset space, we also start triggering wraparounds at 20% of member space. The code in MultiXactCheckMemberSpace would just say safe_member_count = autovacum_multixact_freeze_max_age * 2, where 2 is some_constant (this number is the average number of multixact members below which your workload will be unaffected by the new autovac behaviour). -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Kevin Grittner
Date:
Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, May 6, 2015 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, May 5, 2015 at 3:58 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >>> Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts >>> before wraparound vacuum, which is ~10% of 2^32, we would interpret >>> that to mean 400 million multixacts OR ~10% * some_constant of member >>> space, in other worlds autovacuum_freeze_max_age * some_constant >>> members, whichever comes first. But what should some_constant be? >> >> some_constant should be all the member space there is. So we trigger >> autovac if we've used more than ~10% of the offsets OR more than ~10% >> of the members. Why is autovacuum_multixact_freeze_max_age >> configurable in the place? It's configurable so that you can set it >> low enough that wraparound scans complete and advance the minmxid >> before you hit the wall, but high enough to avoid excessive scanning. >> The only problem is that it only lets you configure the amount of >> headroom you need for offsets, not members. If you squint at what I'm >> proposing the right way, it's essentially that that GUC should control >> both of those things. > > But member space *always* grows at least twice as fast as offset space > (aka active multixact IDs), because multixacts always have at least 2 > members (except in some rare cases IIUC), don't they? So if we do > what you just said, then we'll trigger wraparound vacuums twice as > soon as we do now for everybody, even people who don't have any > problem with member space management. We don't want this patch to > change anything for most people, let alone everyone. That, I think, is what has been driving this patch away from just considering the *_multixact_* settings as applying to both the members SLRU and the offsets SLRU; that would effectively simply change the monitored resource from one to the other. (We would probably want to actually use the max of the two, just to be safe, but that offsets might never actually be the trigger.) As Thomas says, that would be a big change for everyone, and not everyone necessarily *wants* their existing settings to have new and different meanings. > So I think that > some_constant should be at least 2, if we try to do it this way, in > other words if you set the GUC for 10% of offset space, we also start > triggering wraparounds at 20% of member space. But what if they configure it to start at 80% (which I *have* seen people do)? The early patches were a heuristic to attempt to allow current behavior for those not getting into trouble, and gradually ramp up aggressiveness as needed to prevent hitting the hard ERROR that now prevents wraparound. Perhaps, rather than reducing the threshold gradually, as the members SLRU approaches wraparound we could gradually shift from using offsets to members as the number we compare the thresholds to. Up to 25% of maximum members, or if offset is somehow larger, we just use offsets; else above 75% maximum members we use members; else we use a weighted average based on how far we are between 25% and 75%. It's kinda weird, but I think it gives us a reasonable way to ramp up up vacuum aggressiveness from what we currently do toward what Robert proposed based on whether the workload is causing things to head for trouble. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Wed, May 6, 2015 at 9:46 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, May 6, 2015 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, May 5, 2015 at 3:58 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >>> Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts >>> before wraparound vacuum, which is ~10% of 2^32, we would interpret >>> that to mean 400 million multixacts OR ~10% * some_constant of member >>> space, in other worlds autovacuum_freeze_max_age * some_constant >>> members, whichever comes first. But what should some_constant be? >> >> some_constant should be all the member space there is. So we trigger >> autovac if we've used more than ~10% of the offsets OR more than ~10% >> of the members. Why is autovacuum_multixact_freeze_max_age >> configurable in the place? It's configurable so that you can set it >> low enough that wraparound scans complete and advance the minmxid >> before you hit the wall, but high enough to avoid excessive scanning. >> The only problem is that it only lets you configure the amount of >> headroom you need for offsets, not members. If you squint at what I'm >> proposing the right way, it's essentially that that GUC should control >> both of those things. > > But member space *always* grows at least twice as fast as offset space > (aka active multixact IDs), because multixacts always have at least 2 > members (except in some rare cases IIUC), don't they? So if we do > what you just said, then we'll trigger wraparound vacuums twice as > soon as we do now for everybody, even people who don't have any > problem with member space management. We don't want this patch to > change anything for most people, let alone everyone. So I think that > some_constant should be at least 2, if we try to do it this way, in > other words if you set the GUC for 10% of offset space, we also start > triggering wraparounds at 20% of member space. The code in > MultiXactCheckMemberSpace would just say safe_member_count = > autovacum_multixact_freeze_max_age * 2, where 2 is some_constant (this > number is the average number of multixact members below which your > workload will be unaffected by the new autovac behaviour). Here is a version of the patch that uses the GUC to control where safe_member_count starts as you suggested. But it multiplies it by a scaling factor that I'm calling avg_multixact_size_threshold. The key point is this: if your average multixact size is below this number, you will see no change in autovacuum behaviour from this patch, because normal wraparound vacuums will occur before you ever exceed safe_member_count, but if your average multixact size is above this number, you'll see some extra wraparound vacuuming. As for its value, I start the bidding at 3. Unless I misunderstood, your proposal amounted to using a value of 1, but I think that is too small, because surely *everyone's* average multixact size is above 1. We don't want to change the autovacuum behaviour for everyone. I figure the value should be a smallish number of at least 2, and I speculate that that multixacts might have one of those kinds of distributions where there are lots of 2s, a few 3s, hardly any 4s etc, so that the average would be somewhere not far above 2. 3 seems to fit that description. I could be completely wrong about that, but even so, with the default GUC setting of 400 million, 3 gives you (400000000 * 3) / 2^32 = ~28%, pretty close to the 25% that people seemed to like when we were talking about a fixed constant. Higher numbers could work but would make us less aggressive and increase the wraparound risk, and 8 would definitely be too high, because then safe_member_count crashes into dangerous_member_count with the default GUC setting, so our ramping is effectively disabled giving us a panic mode/cliff. The new patch also tests the dangerous level *before* the safe level, so that our 75% threshold is still triggered even if you set the GUC insanely high so that safe_member_count finished up higher than dangerous_member_count. BTW I think you can estimate the average number of members per multixact in a real database like this: number of members = number of member segment files * 1636 * 32 over number of multixacts = number of offsets segment files * 2048 * 32 Those numbers are from these macros with a default page size: MULTIXACT_MEMBERS_PER_PAGE = 1636 MULTIXACT_OFFSETS_PER_PAGE = 2048 SLRU_PAGES_PER_SEGMENT = 32 -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Tue, May 5, 2015 at 5:46 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > But member space *always* grows at least twice as fast as offset space > (aka active multixact IDs), because multixacts always have at least 2 > members (except in some rare cases IIUC), don't they? Oh. *facepalm* All right, so maybe the way you had it is best after all. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Tue, May 5, 2015 at 6:36 PM, Kevin Grittner <kgrittn@ymail.com> wrote: >> So I think that >> some_constant should be at least 2, if we try to do it this way, in >> other words if you set the GUC for 10% of offset space, we also start >> triggering wraparounds at 20% of member space. > > But what if they configure it to start at 80% (which I *have* seen > people do)? I might be confused here, but the upper limit for autovacuum_multixact_freeze_max_age is 2 billion, so I don't think this can ever be higher than 50%. Well, 46.5%, really, since 2^32 > 4 billion. autovacuum_freeze_max_age is similarly limited. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Tue, May 5, 2015 at 3:58 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > Here's a new patch, with responses to several reviews. Going back to this version... + * Based on the assumption that there is no reasonable way for an end user to + * configure the thresholds for this, we define the safe member count to be + * half of the member address space, and the dangerous level to be but: + const MultiXactOffset safe_member_count = MaxMultiXactOffset / 4; Those don't match. Also, we usually use #define rather than const for constants. I suggest we do that here, too. + int safe_multixact_age = MultiXactCheckMemberUsage(); + if (safe_multixact_age >= 0) Project style is to leave a blank line between these, I think. I think you need to update the comments for relation_needs_vacanalyze(). The documentation in section 23.1.5.1, "Multixacts and Wraparound", also needs updating. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Wed, May 6, 2015 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: > + * Based on the assumption that there is no reasonable way for an end user to > + * configure the thresholds for this, we define the safe member count to be > + * half of the member address space, and the dangerous level to be > > but: > > + const MultiXactOffset safe_member_count = MaxMultiXactOffset / 4; > > Those don't match. [...] Fixed/obsoleted in the attached patch. It has a dynamic safe_member_count based on scaling the GUC as described in my earlier email with the v7 patch; the behaviour with the default GUC value works out to a similar safe_member_count value, but this way it can be changed if needed, and we don't introduce any new GUCs. Also, since the GUC used in determining safe_member_count is either autovacuum_multixact_freeze_max_age or vacuum_multixact_freeze_max_age depending on which kind of vacuum it is, that is now a parameter passed into MultiXactCheckMemberUsage, so safe_member_count is no longer a constant. > [...] Also, we usually use #define rather than const for > constants. I suggest we do that here, too. Done for DANGEROUS_MEMBER_COUNT and AVG_MULTIXACT_SIZE_THRESHOLD which are still constants. > + int safe_multixact_age = MultiXactCheckMemberUsage(); > + if (safe_multixact_age >= 0) > > Project style is to leave a blank line between these, I think. Done. > I think you need to update the comments for relation_needs_vacanalyze(). Done. > The documentation in section 23.1.5.1, "Multixacts and Wraparound", > also needs updating. Done. I also rewrote some of the commentary in MultiXactCheckMemberUsage, to try to make the theory behind the ramp up algorithm clearer. I will do some more testing. Does anyone have any feedback on the choice of value for AVG_MULTIXACT_SIZE_THRESHOLD, or real world data on average multixact sizes, or see more general problems? The idea here is that you should only see autovacuum behaviour change if it's more than AVG_MULTIXACT_SIZE_THRESHOLD. -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Wed, May 6, 2015 at 6:26 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Wed, May 6, 2015 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> + * Based on the assumption that there is no reasonable way for an end user to >> + * configure the thresholds for this, we define the safe member count to be >> + * half of the member address space, and the dangerous level to be >> >> but: >> >> + const MultiXactOffset safe_member_count = MaxMultiXactOffset / 4; >> >> Those don't match. [...] > > Fixed/obsoleted in the attached patch. It has a dynamic > safe_member_count based on scaling the GUC as described in my earlier > email with the v7 patch; the behaviour with the default GUC value > works out to a similar safe_member_count value, but this way it can be > changed if needed, and we don't introduce any new GUCs. Also, since > the GUC used in determining safe_member_count is either > autovacuum_multixact_freeze_max_age or vacuum_multixact_freeze_max_age > depending on which kind of vacuum it is, that is now a parameter > passed into MultiXactCheckMemberUsage, so safe_member_count is no > longer a constant. To be honest, now that you've pointed out that the fraction of the multixact members space that is in use will always be larger, generally much larger, than the fraction of the offset space that is in use, I've kind of lost all enthusiasm for making the safe_member_count stuff dependent on autovacuum_multixact_freeze_max_age. I'm inclined to go back to 25%, the way you had it before. We could think about adding a new GUC in master, but I'm actually leaning toward the view that we should just hard-code 25% for now and consider revising it later if that proves inadequate. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Amit Kapila
Date:
On Wed, May 6, 2015 at 3:56 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > > On Wed, May 6, 2015 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote: Few comments: 1. + /* + * Override the multixact freeze settings if we are running out of + * member address space. + */ + if (safe_multixact_age >= 0) + { + multixact_freeze_table_age = Min(safe_multixact_age, + multixact_freeze_table_age); + /* Special settings if we are running out of member address space. */ + if (safe_multixact_age >= 0) + multixact_freeze_max_age = Min(multixact_freeze_max_age, safe_multixact_age); + Some places use safe_multixact_age as first parameter and some places use it at second place. I think it is better to use in same order for the sake of consistency. 2. in the hope + * that different tables will be vacuumed at different times due to their + * varying relminmxid values. Does above line in comment on top of MultiXactCheckMemberUsage() makes much sense? 3. + * we know the age of the oldest multixact in the system, so that's the + * value we want to when members is near safe_member_count. It should typo. so that's the value we want to *use* when .. With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Wed, May 6, 2015 at 6:45 AM, Robert Haas <robertmhaas@gmail.com> wrote: > To be honest, now that you've pointed out that the fraction of the > multixact members space that is in use will always be larger, > generally much larger, than the fraction of the offset space that is > in use, I've kind of lost all enthusiasm for making the > safe_member_count stuff dependent on > autovacuum_multixact_freeze_max_age. I'm inclined to go back to 25%, > the way you had it before. > > We could think about adding a new GUC in master, but I'm actually > leaning toward the view that we should just hard-code 25% for now and > consider revising it later if that proves inadequate. So here's a new patch, based on your latest version, which looks reasonably committable to me. I made a lot of revisions to the comments. I changed some variable and function names, too. I also reworked the function that determines the accelerated freeze threshold because I was afraid that the integer return value might not be large enough to accommodate the uint32 result of the derating calculation, which would then turn negative. The new logic is also, I believe, less vulnerable to floating-point roundoff dangers. And I tightened up the documentation. Some residual concerns: 1. Should we be installing one or more GUCs to control this behavior? I've gone back to hard-coding things so that at 25% we start triggering autovacuum and by 75% we zero out the freeze ages, because the logic you proposed in your last version looks insanely complicated to me. (I do realize that I suggested the approach, but that was before I realized the full complexity of the problem.) I now think that if we want to make this tunable, we need to create and expose GUCs for it. I'm hoping we can get by without that, but I'm not sure. 2. Doesn't the code that sets MultiXactState->multiVacLimit also need to use what I'm now calling MultiXactMemberFreezeThreshold() - or some similar logic? Otherwise, a user with autovacuum=off won't get emergency autovacuums for member exhaustion, even though they will get them for offset exhaustion. Despite those concerns I think we're headed in the right direction here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
I haven't read your patch, but I wonder if we should decrease the default value of multixact_freeze_table_age (currently 150 million). The freeze_min_age is 5 million; if freeze_table_age were a lot lower, the problem would be less pronounced. Additionally, I will backpatch commit 27846f02c176. The average size of multixacts decreases with that fix in many common cases, which greatly reduces the need for any of this in the first place. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Robert Haas wrote: > So here's a new patch, based on your latest version, which looks > reasonably committable to me. I think this code should also reduce the multixact_freeze_min_age value at the same time as multixact_freeze_table_age. If the table age is reduced but freeze_min_age remains high, old multixacts might still remain in the table. The default value for freeze min age is 5 million, but users may change it. Perhaps freeze min age should be set to Min(modified freeze table age, freeze min age) so that old multixacts are effectively frozen whenever a full table scan requested. > 1. Should we be installing one or more GUCs to control this behavior? > I've gone back to hard-coding things so that at 25% we start > triggering autovacuum and by 75% we zero out the freeze ages, because > the logic you proposed in your last version looks insanely complicated > to me. (I do realize that I suggested the approach, but that was > before I realized the full complexity of the problem.) I now think > that if we want to make this tunable, we need to create and expose > GUCs for it. I'm hoping we can get by without that, but I'm not sure. I think things are complicated enough; I vote for no additional GUCs at this point. > 2. Doesn't the code that sets MultiXactState->multiVacLimit also need > to use what I'm now calling MultiXactMemberFreezeThreshold() - or some > similar logic? Otherwise, a user with autovacuum=off won't get > emergency autovacuums for member exhaustion, even though they will get > them for offset exhaustion. Yeah, it looks like it does. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Kevin Grittner
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> So here's a new patch, based on your latest version, which looks >> reasonably committable to me. > > I think this code should also reduce the multixact_freeze_min_age value > at the same time as multixact_freeze_table_age. If the table age is > reduced but freeze_min_age remains high, old multixacts might still > remain in the table. The default value for freeze min age is 5 million, > but users may change it. Perhaps freeze min age should be set to > Min(modified freeze table age, freeze min age) so that old multixacts > are effectively frozen whenever a full table scan requested. I would rather see min age reduced proportionally to table age, or at least ensure that min age is some percentage below table age. >> 1. Should we be installing one or more GUCs to control this behavior? >> I've gone back to hard-coding things so that at 25% we start >> triggering autovacuum and by 75% we zero out the freeze ages, because >> the logic you proposed in your last version looks insanely complicated >> to me. (I do realize that I suggested the approach, but that was >> before I realized the full complexity of the problem.) I now think >> that if we want to make this tunable, we need to create and expose >> GUCs for it. I'm hoping we can get by without that, but I'm not sure. > > I think things are complicated enough; I vote for no additional GUCs at > this point. +1 For one thing, we should try to have something we can back-patch, and new GUCs in a minor release seems like something to avoid, if possible. For another thing, we've tended not to put in GUCs if there is no reasonable way for a user to determine a good value, and that seems to be the case here. >> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need >> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some >> similar logic? Otherwise, a user with autovacuum=off won't get >> emergency autovacuums for member exhaustion, even though they will get >> them for offset exhaustion. > > Yeah, it looks like it does. +1 -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Wed, May 6, 2015 at 10:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> So here's a new patch, based on your latest version, which looks >> reasonably committable to me. > > I think this code should also reduce the multixact_freeze_min_age value > at the same time as multixact_freeze_table_age. I think it does that. It sets the min age to half the value it sets for the table age, which I think is consistent with what we do elsewhere. >> 1. Should we be installing one or more GUCs to control this behavior? >> I've gone back to hard-coding things so that at 25% we start >> triggering autovacuum and by 75% we zero out the freeze ages, because >> the logic you proposed in your last version looks insanely complicated >> to me. (I do realize that I suggested the approach, but that was >> before I realized the full complexity of the problem.) I now think >> that if we want to make this tunable, we need to create and expose >> GUCs for it. I'm hoping we can get by without that, but I'm not sure. > > I think things are complicated enough; I vote for no additional GUCs at > this point. That's fine with me for now. >> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need >> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some >> similar logic? Otherwise, a user with autovacuum=off won't get >> emergency autovacuums for member exhaustion, even though they will get >> them for offset exhaustion. > > Yeah, it looks like it does. OK, I'm not clear how to do that correctly, exactly, but hopefully one of us can figure that out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Thu, May 7, 2015 at 4:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, May 6, 2015 at 10:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need >>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some >>> similar logic? Otherwise, a user with autovacuum=off won't get >>> emergency autovacuums for member exhaustion, even though they will get >>> them for offset exhaustion. >> >> Yeah, it looks like it does. > > OK, I'm not clear how to do that correctly, exactly, but hopefully one > of us can figure that out. MultiXactState->multiVacLimit holds the multixact IDs at which an age thresholds will be crossed, so that GetNewMultiXactId can check it cheaply. But we can't predict the future multixact IDs at which our member usage threshold will be crossed. We could try to estimate it based on past multixact sizes, but (as I think we already covered somewhere else) we shouldn't be trying to do that because it wouldn't handle the situation where your member space consumption rate suddenly went up, among other problems. How about this: we add oldestOffset to MultiXactState, to be set by DetermineSafeOldestOffset, and then at the place where GetNewMultiXactId checks if (!MultiXactIdPrecedes(result, MultiXactState->multiVacLimit) it could also check whether (nextOffset - MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD). ReadMultiXactCounts should also use the oldestOffset value directly from shmem instead of calling find_multixact_start. -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On May 6, 2015, at 9:48 PM, Thomas Munro <thomas.munro@enterprisedb.com> wro= te: >> On Thu, May 7, 2015 at 4:23 AM, Robert Haas <robertmhaas@gmail.com> wrote= : >> On Wed, May 6, 2015 at 10:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com= > wrote: >>>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need >>>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some >>>> similar logic? Otherwise, a user with autovacuum=3Doff won't get >>>> emergency autovacuums for member exhaustion, even though they will get >>>> them for offset exhaustion. >>>=20 >>> Yeah, it looks like it does. >>=20 >> OK, I'm not clear how to do that correctly, exactly, but hopefully one >> of us can figure that out. >=20 > MultiXactState->multiVacLimit holds the multixact IDs at which an age > thresholds will be crossed, so that GetNewMultiXactId can check it > cheaply. But we can't predict the future multixact IDs at which our > member usage threshold will be crossed. We could try to estimate it > based on past multixact sizes, but (as I think we already covered > somewhere else) we shouldn't be trying to do that because it wouldn't > handle the situation where your member space consumption rate suddenly > went up, among other problems. >=20 > How about this: we add oldestOffset to MultiXactState, to be set by > DetermineSafeOldestOffset, and then at the place where > GetNewMultiXactId checks if (!MultiXactIdPrecedes(result, > MultiXactState->multiVacLimit) it could also check whether (nextOffset > - MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD). > ReadMultiXactCounts should also use the oldestOffset value directly > from shmem instead of calling find_multixact_start. That sounds pretty good. ...Robert=
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Thu, May 7, 2015 at 2:44 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On May 6, 2015, at 9:48 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >>> On Thu, May 7, 2015 at 4:23 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Wed, May 6, 2015 at 10:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>>>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need >>>>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some >>>>> similar logic? Otherwise, a user with autovacuum=off won't get >>>>> emergency autovacuums for member exhaustion, even though they will get >>>>> them for offset exhaustion. >>>> >>>> Yeah, it looks like it does. >>> >>> OK, I'm not clear how to do that correctly, exactly, but hopefully one >>> of us can figure that out. >> >> MultiXactState->multiVacLimit holds the multixact IDs at which an age >> thresholds will be crossed, so that GetNewMultiXactId can check it >> cheaply. But we can't predict the future multixact IDs at which our >> member usage threshold will be crossed. We could try to estimate it >> based on past multixact sizes, but (as I think we already covered >> somewhere else) we shouldn't be trying to do that because it wouldn't >> handle the situation where your member space consumption rate suddenly >> went up, among other problems. >> >> How about this: we add oldestOffset to MultiXactState, to be set by >> DetermineSafeOldestOffset, and then at the place where >> GetNewMultiXactId checks if (!MultiXactIdPrecedes(result, >> MultiXactState->multiVacLimit) it could also check whether (nextOffset >> - MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD). >> ReadMultiXactCounts should also use the oldestOffset value directly >> from shmem instead of calling find_multixact_start. > > That sounds pretty good. See attached patch, based on your multixact-av.patch. With autovacuum set to off, it vacuums as expected. I wonder if DetermineSafeOdlestOffset is being called in all the right places to guarantee that the state is initialised. This patch will change it anyway, but I noticed that oldestOffset's computation to find the start of the segment seems wrong in master, I think it should be like this, no? @@ -2495,7 +2495,7 @@ DetermineSafeOldestOffset(MultiXactId oldestMXact) */ oldestOffset = find_multixact_start(oldestMXact); /* move back to start of the corresponding segment */ - oldestOffset -= oldestOffset / MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT; + oldestOffset -= oldestOffset % (MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT); LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); /* always leave one segment before the wraparound point */ -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Kevin Grittner
Date:
Thomas Munro <thomas.munro@enterprisedb.com> wrote: > I noticed that oldestOffset's computation to find the start of > the segment seems wrong in master, I think it should be like > this, no? The change you show looks to me like the right way to take oldestOffset and reduce to the first on a page. Without your suggested change the calculation lands the value at a location that has no real meaning. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Thu, May 7, 2015 at 12:23 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > This patch will change it anyway, but I noticed that oldestOffset's > computation to find the start of the segment seems wrong in master, I > think it should be like this, no? > > @@ -2495,7 +2495,7 @@ DetermineSafeOldestOffset(MultiXactId oldestMXact) > */ > oldestOffset = find_multixact_start(oldestMXact); > /* move back to start of the corresponding segment */ > - oldestOffset -= oldestOffset / MULTIXACT_MEMBERS_PER_PAGE * > SLRU_PAGES_PER_SEGMENT; > + oldestOffset -= oldestOffset % (MULTIXACT_MEMBERS_PER_PAGE * > SLRU_PAGES_PER_SEGMENT); > > LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE); > /* always leave one segment before the wraparound point */ This should be committed and back-patched separately, I think. I'll go do that now. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Kevin Grittner
Date:
Kevin Grittner <kgrittn@ymail.com> wrote: > The change you show looks to me like the right way to take > oldestOffset and reduce to the first on a page. Of course, I meant the first offset on the first page of a segment. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Robert Haas wrote: > On Thu, May 7, 2015 at 12:23 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: > > This patch will change it anyway, but I noticed that oldestOffset's > > computation to find the start of the segment seems wrong in master, I > > think it should be like this, no? > This should be committed and back-patched separately, I think. I'll > go do that now. I was going to say the same. Thanks. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Thu, May 7, 2015 at 11:18 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> On Thu, May 7, 2015 at 12:23 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >> > This patch will change it anyway, but I noticed that oldestOffset's >> > computation to find the start of the segment seems wrong in master, I >> > think it should be like this, no? > >> This should be committed and back-patched separately, I think. I'll >> go do that now. > > I was going to say the same. Thanks. OK, so here is what's in Thomas's latest version that's not already committed and different from my last version. It should apply on top of my multixact-av.patch, but I attach it here as a separate extract for ease of review. I think what we need to determine is whether this is the right fix for the problem of starting emergency autovacuums for member space exhaustion even when autovacuum=off. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Thu, May 7, 2015 at 11:30 AM, Robert Haas <robertmhaas@gmail.com> wrote: > OK, so here is what's in Thomas's latest version that's not already > committed and different from my last version. It should apply on top > of my multixact-av.patch, but I attach it here as a separate extract > for ease of review. I think what we need to determine is whether this > is the right fix for the problem of starting emergency autovacuums for > member space exhaustion even when autovacuum=off. Nope, I extracted that incorrectly. Second try attached. I think Thomas's question upthread is a very good one: > See attached patch, based on your multixact-av.patch. With autovacuum > set to off, it vacuums as expected. I wonder if > DetermineSafeOdlestOffset is being called in all the right places to > guarantee that the state is initialised. It seems to me that the most obvious places where DetermineSafeOldestOffset() should be called are (1) at startup or after recovery, to initialize the value; and (2) each time we truncate the SLRU, to update the value. Other than that, this doesn't change. The startup calls are there, in apparently reasonable places, but it's not obvious to me how this gets called in the TruncateMultiXact path. Instead it seems to get set via the SetMultiXactIdLimit path. Maybe that's OK, but it would seem to imply that we're OK with overwriting old members information if that information was slated for truncation at the next checkpoint anyway, which seems scary. Wouldn't it be better to only advance the threshold once the checkpoint completes and the truncation is done? Maybe I'm misreading something here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Robert Haas wrote: > It seems to me that the most obvious places where > DetermineSafeOldestOffset() should be called are (1) at startup or > after recovery, to initialize the value; and (2) each time we truncate > the SLRU, to update the value. Other than that, this doesn't change. > The startup calls are there, in apparently reasonable places, but it's > not obvious to me how this gets called in the TruncateMultiXact path. > Instead it seems to get set via the SetMultiXactIdLimit path. Maybe > that's OK, but it would seem to imply that we're OK with overwriting > old members information if that information was slated for truncation > at the next checkpoint anyway, which seems scary. I considered this question in the previous commit: is it okay to overwrite a file that is no longer used (per the limits set by vacuum) but not yet removed (by checkpoint)? It seems to me that there is no data-loss issue with doing that -- which is why the advance-oldest code is called during vacuum and not during checkpoint. > Wouldn't it be better to only advance the threshold once the > checkpoint completes and the truncation is done? *shrug* There's also a question about this function's "else" branch: void MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB) { if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti)) SetMultiXactIdLimit(oldestMulti, oldestMultiDB); else DetermineSafeOldestOffset(oldestMulti); } The reason the "else" is there is that I chickened out about not calling DetermineSafeOldestOffset() (which SetMultiXactIdLimit calls) when the oldestMulti does not advance; what if in the previous run we failed to get to this point for whatever reason? The calls to SetMultiXactIdLimit are seldom enough that it seemed better to be protected against previous runs not finishing. Then again, I might be worrying over nothing and this is just dead code. Also, if we have extra calls in other places after Thomas' patch, perhaps this one is not necessary. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Thu, May 7, 2015 at 2:00 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: >> It seems to me that the most obvious places where >> DetermineSafeOldestOffset() should be called are (1) at startup or >> after recovery, to initialize the value; and (2) each time we truncate >> the SLRU, to update the value. Other than that, this doesn't change. >> The startup calls are there, in apparently reasonable places, but it's >> not obvious to me how this gets called in the TruncateMultiXact path. >> Instead it seems to get set via the SetMultiXactIdLimit path. Maybe >> that's OK, but it would seem to imply that we're OK with overwriting >> old members information if that information was slated for truncation >> at the next checkpoint anyway, which seems scary. > > I considered this question in the previous commit: is it okay to > overwrite a file that is no longer used (per the limits set by vacuum) > but not yet removed (by checkpoint)? It seems to me that there is no > data-loss issue with doing that -- which is why the advance-oldest code > is called during vacuum and not during checkpoint. I think the main question is whether truncation will be smart enough to zap the non-overwritten part of the old stuff but not the part that did get overwritten. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Robert Haas wrote: > On Thu, May 7, 2015 at 2:00 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I considered this question in the previous commit: is it okay to > > overwrite a file that is no longer used (per the limits set by vacuum) > > but not yet removed (by checkpoint)? It seems to me that there is no > > data-loss issue with doing that -- which is why the advance-oldest code > > is called during vacuum and not during checkpoint. > > I think the main question is whether truncation will be smart enough > to zap the non-overwritten part of the old stuff but not the part that > did get overwritten. TruncateMultiXact bases its removal of member files on the point set by checkpoint, MultiXactState->lastCheckpointedOldest. I tested this back then and there was no issue. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Wed, May 6, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: > So here's a new patch, based on your latest version, which looks > reasonably committable to me. I made a lot of revisions to the > comments. I changed some variable and function names, too. I also > reworked the function that determines the accelerated freeze threshold > because I was afraid that the integer return value might not be large > enough to accommodate the uint32 result of the derating calculation, > which would then turn negative. The new logic is also, I believe, > less vulnerable to floating-point roundoff dangers. And I tightened > up the documentation. Here's a new revision of this patch. When I looked into back-patching this, it conflicted in ExecVacuum(). And the more I thought about that, the less I liked it, because I think ExecVacuum() is really intended to translate from the parse tree representation to whatever internal representation we're going to pass to vacuum, NOT to make policy decisions. So in the attached version, what I did instead is modify vacuum_set_xid_limits() to call MultiXactMemberFreezeThreshold() itself instead of trying to pass that value down via the vacuum options. It seems to me that this greatly reduces the chances of inconsistency between member-wrap vacuums and offset-wrap vacuums. The code seems more elegant, too: MultiXactMemberFreezeThreshold() now returns autovacuum_multixact_freeze_max_age rather than -1 when no special handling is needed to prevent member wraparound, and all of the callers now find this convention convenient. As a bonus, it now back-patches to 9.4 without conflicts. There are some conflicts on 9.3 but they don't look bad. This will need some testing to make sure it still works; I haven't done that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Fri, May 8, 2015 at 6:00 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Robert Haas wrote: > >> It seems to me that the most obvious places where >> DetermineSafeOldestOffset() should be called are (1) at startup or >> after recovery, to initialize the value; and (2) each time we truncate >> the SLRU, to update the value. Other than that, this doesn't change. >> The startup calls are there, in apparently reasonable places, but it's >> not obvious to me how this gets called in the TruncateMultiXact path. >> Instead it seems to get set via the SetMultiXactIdLimit path. Maybe >> that's OK, but it would seem to imply that we're OK with overwriting >> old members information if that information was slated for truncation >> at the next checkpoint anyway, which seems scary. > > I considered this question in the previous commit: is it okay to > overwrite a file that is no longer used (per the limits set by vacuum) > but not yet removed (by checkpoint)? It seems to me that there is no > data-loss issue with doing that -- which is why the advance-oldest code > is called during vacuum and not during checkpoint. Given three concurrent backends that are checkpointing, vacuuming and creating multixacts, isn't there some ordering of events that could cause SlruScanDirCbRemoveMembers to delete a segment file that it judges to be the tail segment, but which has become the new head segment just before the file gets unlinked? The comment in SlrScanDirCbRemoveMembers says "... and we never increase rangeStart during any one run", but I think that a concurrent vacuum could advance the oldest multixact and therefore the oldest offset (which is what range->rangeStart was derived from), and if you're extra unlucky, a third backend that is creating multixacts could then advance nextOffset so that it points to a page *past* range->rangeStart, and we'd start blowing away the wrong data. I may have the details wrong there, but in general I can't see how you can delete just the right segment files while the head and tail pointers of this circular buffer are both moving, unless you hold the lock to stop that while you make the decision and unlink the file (which I assume is out of the question), or say 'well if we keep the head and tail pointers at least 20 pages apart, that's unlikely to happen' (which I assume is also out of the question). Now I understand the suggestion that the checkpoint code could be in charge of advancing the oldest multixact + offset. But if we did that, our new autovacuum code would get confused and keep trying to vacuum stuff that it's already dealt with, until the next checkpoint... so maybe we'd need to track both (1) the offset of the oldest multixact: the one that we use to trigger autovacuums, even though there might be even older segments still on disk, and (2) offset_of_start_of_oldest_segment_on_disk (suitably compressed): the one that we use to block wraparound in GetNewMultiXactId, so that we never write into a file that checkpoint might decide to delete. > Also, if we have extra calls in other places after Thomas' patch, > perhaps this one is not necessary. No extra calls added. -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On May 7, 2015, at 6:21 PM, Thomas Munro <thomas.munro@enterprisedb.com> wro= te: > I may have the details wrong there, but in general I can't see how you > can delete just the right segment files while the head and tail > pointers of this circular buffer are both moving, unless you hold the > lock to stop that while you make the decision and unlink the file > (which I assume is out of the question), or say 'well if we keep the > head and tail pointers at least 20 pages apart, that's unlikely to > happen' (which I assume is also out of the question). =20 That is exactly the sort of thing I was worried about, but I couldn't put my= finger on it. I think your analysis is right. > Now I > understand the suggestion that the checkpoint code could be in charge > of advancing the oldest multixact + offset. Yeah. I think we need to pursue that angle unless somebody has a better idea= . It would also address another issue I am concerned about: the current sys= tem seems to advance the oldest multixact information in shared memory diffe= rently on the primary and the standby. I'm not sure the logic actually works= on the standby at all at this point, but even if it does, it seems unlikely= be right to rely on redo to do on the standby what is being done by a compl= etely different, not-WAL-logged operation on the master. Making the checkpo= int do it in both cases would fix that. > But if we did that, our new autovacuum code would get confused and > keep trying to vacuum stuff that it's already dealt with, until the > next checkpoint... so maybe we'd need to track both (1) the offset of > the oldest multixact: the one that we use to trigger autovacuums, even > though there might be even older segments still on disk, and (2) > offset_of_start_of_oldest_segment_on_disk (suitably compressed): the > one that we use to block wraparound in GetNewMultiXactId, so that we > never write into a file that checkpoint might decide to delete. That sounds plausible. ...Robert=
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Fri, May 8, 2015 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: > Here's a new revision of this patch. When I looked into back-patching > this, it conflicted in ExecVacuum(). And the more I thought about > that, the less I liked it, because I think ExecVacuum() is really > intended to translate from the parse tree representation to whatever > internal representation we're going to pass to vacuum, NOT to make > policy decisions. So in the attached version, what I did instead is > modify vacuum_set_xid_limits() to call > MultiXactMemberFreezeThreshold() itself instead of trying to pass that > value down via the vacuum options. It seems to me that this greatly > reduces the chances of inconsistency between member-wrap vacuums and > offset-wrap vacuums. The code seems more elegant, too: > MultiXactMemberFreezeThreshold() now returns > autovacuum_multixact_freeze_max_age rather than -1 when no special > handling is needed to prevent member wraparound, and all of the > callers now find this convention convenient. As a bonus, it now > back-patches to 9.4 without conflicts. There are some conflicts on > 9.3 but they don't look bad. This is definitely tidier. > This will need some testing to make sure it still works; I haven't done that. I tested it with autovacuum on, and it automatically vacuumed after 25% of member space was used. I tested it with autovacuum off, and I ran a VACUUM manually every few minutes and checked that it didn't advance my database's datminmxid, until after 25% of member space was used, after which a full scan was triggered and datminmxid moved. As expected, that didn't advance the cluster-wide oldest multixact, because for that you'd need to vacuum ALL databases including non-connectable ones. (This patch doesn't update the forced-autovacuum-even-if-autovacuum-is-off feature to handle member space wraparound in GetNextMultiXactId, which is coming in a separate patch[1].) So it works as expected. [1] http://www.postgresql.org/message-id/CA+TgmobbaQpE6sNqT30+rz4UMH5aPraq20gko5xd2ZGajz1-Jg@mail.gmail.com -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Thu, May 7, 2015 at 2:00 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > There's also a question about this function's "else" branch: > > void > MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB) > { > if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti)) > SetMultiXactIdLimit(oldestMulti, oldestMultiDB); > else > DetermineSafeOldestOffset(oldestMulti); > } > > The reason the "else" is there is that I chickened out about not calling > DetermineSafeOldestOffset() (which SetMultiXactIdLimit calls) when the > oldestMulti does not advance; what if in the previous run we failed to > get to this point for whatever reason? The calls to SetMultiXactIdLimit > are seldom enough that it seemed better to be protected against previous > runs not finishing. Then again, I might be worrying over nothing and > this is just dead code. It's dead code for more pedestrian reasons: MultiXactAdvanceOldest() is called only from xlog_redo(), but does nothing if InRecovery. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Michael Paquier
Date:
On Fri, May 8, 2015 at 5:32 AM, Robert Haas wrote: > because I think ExecVacuum() is really > intended to translate from the parse tree representation to whatever > internal representation we're going to pass to vacuum, NOT to make > policy decisions. Yep. That's more or less the idea behind it. -- Michael
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Thu, May 7, 2015 at 7:58 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> Now I >> understand the suggestion that the checkpoint code could be in charge >> of advancing the oldest multixact + offset. > > Yeah. I think we need to pursue that angle unless somebody has a better idea. So here's a patch that does that. It turns out to be pretty simple: I just moved the DetermineSafeOldestOffset() calls around. As far as I can see, and that may not be far enough at this hour of the morning, we just need two: one in StartupMultiXact, so that we initialize the values correctly after reading the control file; and then another at the very end of TruncateMultiXact, so we update it after each checkpoint or restartpoint. This leaves the residual problem that autovacuum doesn't directly advance the stop point - the following checkpoint does. We could handle that by requesting a checkpoint if oldestMultiXactId is ahead of lastCheckpointedOldest by enough that a checkpoint would free up some space, although one might hope that people won't be living on the edge to quite that degree. As things are, I believe this sequence is possible: 1. The members SLRU is full all the way up to offsetStopLimit. 2. A checkpoint occurs, reaching MultiXactSetSafeTruncate(), which sets lastCheckpointedOldest. 3. Vacuum runs, calling SetMultiXactIdLimit(), calling DetermineSafeOldestOffset(), advancing MultiXactState->offsetStopLimit. 4. Since offsetStopLimit > lastCheckpointedOffset, it's now possible for someone to consume an MXID greater than offsetStopLimit, making MultiXactState->nextOffset > lastCheckpointedOffset 5. The checkpoint from step 1, continuing on its merry way, now calls TruncateMultiXact(), which sets rangeEnd > rangeStart and blows away nearly every file in the SLRU. I haven't confirmed this yet, so I might still be all wet, especially since it is late here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Fri, May 8, 2015 at 6:25 PM, Robert Haas <robertmhaas@gmail.com> wrote: > 1. The members SLRU is full all the way up to offsetStopLimit. > 2. A checkpoint occurs, reaching MultiXactSetSafeTruncate(), which > sets lastCheckpointedOldest. > 3. Vacuum runs, calling SetMultiXactIdLimit(), calling > DetermineSafeOldestOffset(), advancing > MultiXactState->offsetStopLimit. > 4. Since offsetStopLimit > lastCheckpointedOffset, it's now possible > for someone to consume an MXID greater than offsetStopLimit, making > MultiXactState->nextOffset > lastCheckpointedOffset > 5. The checkpoint from step 1, continuing on its merry way, now calls > TruncateMultiXact(), which sets rangeEnd > rangeStart and blows away > nearly every file in the SLRU. I am still working on reproducing this race scenario various different ways including the way you described, but at step 4 I kept getting stuck, unable to create new multixacts despite having vacuum-frozen all databases (including template0) and advanced the cluster minimum mxid. I think I see why, and I think it's a bug: if you vacuum freeze all your databases, MultiXactState->oldestMultiXactId finishes up equal to MultiXactState->nextMXact. But that's not actually a multixact that exists yet, so when when DetermineSafeOldestOffset calls find_multixact_start, it reads a garbage offset (all zeros in practice since pages start out zeroed) and produces a garbage value for offsetStopLimit which might incorrectly stop you from creating any more multixacts even though member space is entirely empty (but it depends on where your nextOffset happens to be at the time). I think the fix is something like "if nextMXact == oldestMultiXactId, then there are no active multixacts, so the offsetStopLimit should be set to nextOffset - (a segment's worth)". -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Alvaro Herrera
Date:
Thomas Munro wrote: > I think I see why, and I think it's a bug: if you vacuum freeze all > your databases, MultiXactState->oldestMultiXactId finishes up equal to > MultiXactState->nextMXact. Uhm, that's rather silly. > But that's not actually a multixact that exists yet, so when when > DetermineSafeOldestOffset calls find_multixact_start, it reads a > garbage offset (all zeros in practice since pages start out zeroed) > and produces a garbage value for offsetStopLimit which might > incorrectly stop you from creating any more multixacts even though > member space is entirely empty (but it depends on where your > nextOffset happens to be at the time). Right. > I think the fix is something like "if nextMXact == oldestMultiXactId, > then there are no active multixacts, so the offsetStopLimit should be > set to nextOffset - (a segment's worth)". Makes sense. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Fri, May 8, 2015 at 9:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Thomas Munro wrote: >> I think I see why, and I think it's a bug: if you vacuum freeze all >> your databases, MultiXactState->oldestMultiXactId finishes up equal to >> MultiXactState->nextMXact. > > Uhm, that's rather silly. > >> But that's not actually a multixact that exists yet, so when when >> DetermineSafeOldestOffset calls find_multixact_start, it reads a >> garbage offset (all zeros in practice since pages start out zeroed) >> and produces a garbage value for offsetStopLimit which might >> incorrectly stop you from creating any more multixacts even though >> member space is entirely empty (but it depends on where your >> nextOffset happens to be at the time). > > Right. > >> I think the fix is something like "if nextMXact == oldestMultiXactId, >> then there are no active multixacts, so the offsetStopLimit should be >> set to nextOffset - (a segment's worth)". > > Makes sense. Here's a patch that attempts to implement this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Sat, May 9, 2015 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, May 8, 2015 at 9:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Thomas Munro wrote: >>> I think the fix is something like "if nextMXact == oldestMultiXactId, >>> then there are no active multixacts, so the offsetStopLimit should be >>> set to nextOffset - (a segment's worth)". >> >> Makes sense. > > Here's a patch that attempts to implement this. Thanks. I think I have managed to reproduce something like the data loss race that we were speculating about. 0. initdb, autovacuum = off, set up explode_mxact_members.c as described elsewhere in the thread. 1. Fill up the members SLRU completely (ie reach state where you can no longer create a new multixact of any size). pg_multixact/members contains 82040 files and the last one is named 14077. 2. Issue CHECKPOINT, but use a debugger to stop inside TruncateMultiXact after it has read MultiXactState->lastCheckpointedOldest and released the lock, but before it calls SlruScanDirectory to delete files... 3. Run VACUUM FREEZE in all databases (including template0). datminmxid moves. 4. Create lots of new multixacts. pg_multixact/members now contains 82041 files and the last one is named 14078 (ie one extra segment, with the highest possible segment number, which couldn't be created before vacuuming because of the one segment gap enforced by DetermineSafeOldestOffset). Segments 0000-0016 have new modified times. 5. ... allow the checkpoint started in step 2 to continue. It deletes segments, keeping only 0000-0016. The segment 14078 which contained active member data has been incorrectly deleted. -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On May 9, 2015, at 8:00 AM, Thomas Munro <thomas.munro@enterprisedb.com> wro= te: >> On Sat, May 9, 2015 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote= : >>> On Fri, May 8, 2015 at 9:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com= > wrote: >>> Thomas Munro wrote: >>>> I think the fix is something like "if nextMXact =3D=3D oldestMultiXactI= d, >>>> then there are no active multixacts, so the offsetStopLimit should be >>>> set to nextOffset - (a segment's worth)". >>>=20 >>> Makes sense. >>=20 >> Here's a patch that attempts to implement this. >=20 > Thanks. I think I have managed to reproduce something like the data > loss race that we were speculating about. >=20 > 0. initdb, autovacuum =3D off, set up explode_mxact_members.c as > described elsewhere in the thread. > 1. Fill up the members SLRU completely (ie reach state where you can > no longer create a new multixact of any size). pg_multixact/members > contains 82040 files and the last one is named 14077. > 2. Issue CHECKPOINT, but use a debugger to stop inside > TruncateMultiXact after it has read > MultiXactState->lastCheckpointedOldest and released the lock, but > before it calls SlruScanDirectory to delete files... > 3. Run VACUUM FREEZE in all databases (including template0). datminmxid m= oves. > 4. Create lots of new multixacts. pg_multixact/members now contains > 82041 files and the last one is named 14078 (ie one extra segment, > with the highest possible segment number, which couldn't be created > before vacuuming because of the one segment gap enforced by > DetermineSafeOldestOffset). Segments 0000-0016 have new modified > times. > 5. ... allow the checkpoint started in step 2 to continue. It > deletes segments, keeping only 0000-0016. The segment 14078 which > contained active member data has been incorrectly deleted. OK. So the next question is: if you then apply the other patch, does that pr= event step 4 and thereby avoid catastrophe? ...Robert=
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Sun, May 10, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On May 9, 2015, at 8:00 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: >>> On Sat, May 9, 2015 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> On Fri, May 8, 2015 at 9:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >>>> Thomas Munro wrote: >>>>> I think the fix is something like "if nextMXact == oldestMultiXactId, >>>>> then there are no active multixacts, so the offsetStopLimit should be >>>>> set to nextOffset - (a segment's worth)". >>>> >>>> Makes sense. >>> >>> Here's a patch that attempts to implement this. >> >> Thanks. I think I have managed to reproduce something like the data >> loss race that we were speculating about. >> >> 0. initdb, autovacuum = off, set up explode_mxact_members.c as >> described elsewhere in the thread. >> 1. Fill up the members SLRU completely (ie reach state where you can >> no longer create a new multixact of any size). pg_multixact/members >> contains 82040 files and the last one is named 14077. >> 2. Issue CHECKPOINT, but use a debugger to stop inside >> TruncateMultiXact after it has read >> MultiXactState->lastCheckpointedOldest and released the lock, but >> before it calls SlruScanDirectory to delete files... >> 3. Run VACUUM FREEZE in all databases (including template0). datminmxid moves. >> 4. Create lots of new multixacts. pg_multixact/members now contains >> 82041 files and the last one is named 14078 (ie one extra segment, >> with the highest possible segment number, which couldn't be created >> before vacuuming because of the one segment gap enforced by >> DetermineSafeOldestOffset). Segments 0000-0016 have new modified >> times. >> 5. ... allow the checkpoint started in step 2 to continue. It >> deletes segments, keeping only 0000-0016. The segment 14078 which >> contained active member data has been incorrectly deleted. > > OK. So the next question is: if you then apply the other patch, does that prevent step 4 and thereby avoid catastrophe? Yes, in a quick test, at step 4 I couldn't proceed. I need to prod this some more on Monday, and also see how it interacts with autovacuum's view of what work needs to be done. Here is my attempt at a summary. In master, we have 3 arbitrarily overlapping processes: 1. VACUUM advances oldest multixact and member tail. 2. CHECKPOINT observes member tail and head (separately) and then deletes storage. 3. Regular transaction obverses tail, checks boundary and advances head. Information flows from 1 to 2, from 3 to 2, and from 1 to 3. 2 doesn't have a consistent view of head and tail, and doesn't prevent them moving while deleting storage, so the effect is that we can delete the wrong range of storage. With the patch, we have 3 arbitrarily overlapping processes: 1. VACUUM advances oldest multixact. 2. CHECKPOINT observes oldest multixact, deletes storage and then advances member tail. 3. Regular transaction observes member tail, checks boundary and advances member head. Information flows from 1 to 2 and from 2 to 3. Although 2 works with a snapshot of the oldest multixact which may move before it deletes storage, 2 knows that the member tail isn't moving (that is its job), and that 3 can't move the head past the the tail (or rather the stop limit which is the tail minus a gap), so the effect of using an out of date oldest multixact is that we err on the side of being too conservative with our allocation of member space, which is good. I suppose you could have a workload that eats member space really fast and checkpoints too infrequently so that you run out of space before a checkpoint advances the tail. I think that is why you were suggesting triggering checkpoints automatically in some cases. But I think that would be a pretty insane workload (I can't convince my computer to consume 2^32 member elements in under a couple of hours using the pathological explode_mxact_members.c workload, and you can't set checkpoint time above 1 hour). -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Fri, May 8, 2015 at 10:46 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> I think the fix is something like "if nextMXact == oldestMultiXactId, >>> then there are no active multixacts, so the offsetStopLimit should be >>> set to nextOffset - (a segment's worth)". >> >> Makes sense. > > Here's a patch that attempts to implement this. I've now committed and back-patched this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Sun, May 10, 2015 at 9:41 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sun, May 10, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> OK. So the next question is: if you then apply the other patch, does that prevent step 4 and thereby avoid catastrophe? > > Yes, in a quick test, at step 4 I couldn't proceed. I need to prod > this some more on Monday, and also see how it interacts with > autovacuum's view of what work needs to be done. The code in master which handles regular autovacuums seems correct with this patch, because it measures member space usage by calling find_multixact_start itself with the oldest multixact ID (it's not dependent on anything that is updated at checkpoint time). The code in the patch at http://www.postgresql.org/message-id/CA+TgmobbaQpE6sNqT30+rz4UMH5aPraq20gko5xd2ZGajz1-Jg@mail.gmail.com would become wrong though, because it would use the (new) variable MultiXactState->oldestOffset (set at checkpoint) to measure the used member space. That means it would repeatedly launch autovacuums, even after clearing away the problem and advancing the oldest multixact ID, until the next checkpoint updates that value. In other words, it can't see its own progress immediately (which is the right approach for blocking new multixact generation, ie defer until checkpoint/truncation, but the wrong approach for triggering autovacuums). I think vacuum (SetMultiXactIdLimit?) needs to update oldestOffset, not checkpoint (DetermineSafeOldestOffset). (The reason for wanting this new value in shared memory is because GetNextMultiXactId needs to be able to check it cheaply for every call, so calling find_multixact_start every time would presumably not fly). -- Thomas Munro http://www.enterprisedb.com
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Thomas Munro
Date:
On Mon, May 11, 2015 at 2:45 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Sun, May 10, 2015 at 9:41 AM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Sun, May 10, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> OK. So the next question is: if you then apply the other patch, does that prevent step 4 and thereby avoid catastrophe? >> >> Yes, in a quick test, at step 4 I couldn't proceed. I need to prod >> this some more on Monday, and also see how it interacts with >> autovacuum's view of what work needs to be done. > > The code in master which handles regular autovacuums seems correct > with this patch, because it measures member space usage by calling > find_multixact_start itself with the oldest multixact ID (it's not > dependent on anything that is updated at checkpoint time). > > The code in the patch at > http://www.postgresql.org/message-id/CA+TgmobbaQpE6sNqT30+rz4UMH5aPraq20gko5xd2ZGajz1-Jg@mail.gmail.com > would become wrong though, because it would use the (new) variable > MultiXactState->oldestOffset (set at checkpoint) to measure the used > member space. That means it would repeatedly launch autovacuums, even > after clearing away the problem and advancing the oldest multixact ID, > until the next checkpoint updates that value. In other words, it > can't see its own progress immediately (which is the right approach > for blocking new multixact generation, ie defer until > checkpoint/truncation, but the wrong approach for triggering > autovacuums). > > I think vacuum (SetMultiXactIdLimit?) needs to update oldestOffset, > not checkpoint (DetermineSafeOldestOffset). (The reason for wanting > this new value in shared memory is because GetNextMultiXactId needs to > be able to check it cheaply for every call, so calling > find_multixact_start every time would presumably not fly). Here's a new version of the patch to do that. As before, it tracks the oldest offset in shared memory, but now that is updated in SetMultiXactIdLimit, so it is always updated at the same time as MultiXactState->oldestMultiXactId (at startup and after full scan vacuums). The value is used in the following places: 1. GetNewMultiXactId uses it to see if it needs to send PMSIGNAL_START_AUTOVAC_LAUNCHER to request autovacuums even if autovacuum is set to off. That is the main purpose of this patch. (GetNewMultiXactId *doesn't* use it for member wraparound prevention: that is based on offsetStopLimit, set by checkpoint code after truncation of physical storage.) 2. SetMultiXactIdLimit itself also uses it to send a PMSIGNAL_START_AUTOVAC_LAUNCHER signal to the postmaster (according to comments this allows immediately doing some more vacuuming upon completion if necessary). 3. ReadMultiXactCounts, called by regular vacuum and autovacuum, rather than doing its own call to find_multixact_start, now also reads it from shared memory. (Incidentally the code this replaces suffers from the problem fixed elsewhere it can call find_multixact_start for a multixact that doesn't exist yet). Vacuum runs as expected with with autovacuum off. Do you think we should be using MULTIXACT_MEMBER_DANGER_THRESHOLD as the trigger level for forced vacuums instead of MULTIXACT_MEMBER_SAFE_THRESHOLD, or something else? -- Thomas Munro http://www.enterprisedb.com
Attachment
Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
From
Robert Haas
Date:
On Mon, May 11, 2015 at 7:56 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > On Mon, May 11, 2015 at 2:45 PM, Thomas Munro > <thomas.munro@enterprisedb.com> wrote: >> On Sun, May 10, 2015 at 9:41 AM, Thomas Munro >> <thomas.munro@enterprisedb.com> wrote: >>> On Sun, May 10, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> OK. So the next question is: if you then apply the other patch, does that prevent step 4 and thereby avoid catastrophe? >>> >>> Yes, in a quick test, at step 4 I couldn't proceed. I need to prod >>> this some more on Monday, and also see how it interacts with >>> autovacuum's view of what work needs to be done. >> >> The code in master which handles regular autovacuums seems correct >> with this patch, because it measures member space usage by calling >> find_multixact_start itself with the oldest multixact ID (it's not >> dependent on anything that is updated at checkpoint time). >> >> The code in the patch at >> http://www.postgresql.org/message-id/CA+TgmobbaQpE6sNqT30+rz4UMH5aPraq20gko5xd2ZGajz1-Jg@mail.gmail.com >> would become wrong though, because it would use the (new) variable >> MultiXactState->oldestOffset (set at checkpoint) to measure the used >> member space. That means it would repeatedly launch autovacuums, even >> after clearing away the problem and advancing the oldest multixact ID, >> until the next checkpoint updates that value. In other words, it >> can't see its own progress immediately (which is the right approach >> for blocking new multixact generation, ie defer until >> checkpoint/truncation, but the wrong approach for triggering >> autovacuums). >> >> I think vacuum (SetMultiXactIdLimit?) needs to update oldestOffset, >> not checkpoint (DetermineSafeOldestOffset). (The reason for wanting >> this new value in shared memory is because GetNextMultiXactId needs to >> be able to check it cheaply for every call, so calling >> find_multixact_start every time would presumably not fly). > > Here's a new version of the patch to do that. As before, it tracks > the oldest offset in shared memory, but now that is updated in > SetMultiXactIdLimit, so it is always updated at the same time as > MultiXactState->oldestMultiXactId (at startup and after full scan > vacuums). > > The value is used in the following places: > > 1. GetNewMultiXactId uses it to see if it needs to send > PMSIGNAL_START_AUTOVAC_LAUNCHER to request autovacuums even if > autovacuum is set to off. That is the main purpose of this patch. > (GetNewMultiXactId *doesn't* use it for member wraparound prevention: > that is based on offsetStopLimit, set by checkpoint code after > truncation of physical storage.) > > 2. SetMultiXactIdLimit itself also uses it to send a > PMSIGNAL_START_AUTOVAC_LAUNCHER signal to the postmaster (according to > comments this allows immediately doing some more vacuuming upon > completion if necessary). > > 3. ReadMultiXactCounts, called by regular vacuum and autovacuum, > rather than doing its own call to find_multixact_start, now also reads > it from shared memory. (Incidentally the code this replaces suffers > from the problem fixed elsewhere it can call find_multixact_start for > a multixact that doesn't exist yet). > > Vacuum runs as expected with with autovacuum off. Great. I've committed this and back-patched it with 9.3, after making your code look a little more like what I already committed for the same task, and whacking the comments around. > Do you think we > should be using MULTIXACT_MEMBER_DANGER_THRESHOLD as the trigger level > for forced vacuums instead of MULTIXACT_MEMBER_SAFE_THRESHOLD, or > something else? No, I think you have it right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company