Thread: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)

BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)

From
tgarnett@panjiva.com
Date:
The following bug has been logged on the website:

Bug reference:      12990
Logged by:          Timothy Garnett
Email address:      tgarnett@panjiva.com
PostgreSQL version: 9.3.5
Operating system:   Ubuntu Linux x86_64 12.04.5 LTS
Description:

At 4/1 5:10pm our production database started throwing these kinds of errors
(all of our hot spares were also similarly corrupted):

ERROR: could not access status of transaction 303450738
DETAIL: Could not open file "pg_multixact/members/7B49": No such file or
directory.

This is possibly related to bug 8673 (though the mailing list sounds like
most issues there were resolved by 9.3.4 and we are on 9.3.5) and also of
note this db cluster was pg_upgraded from 9.2 at one point (to 9.3.x then
further upgraded to 9.3.5)(see
http://www.postgresql.org/message-id/CAAS3ty+2ynCyf_YmRn6WuqSF8EmJMDypAkc7uD_EXTZJ7usOSg@mail.gmail.com
, though it doesn't seem like that's involved here).

We have a file-system level snapshot of a hot spare of the db from about 28
hours before the errors started and sufficient wal files to do point in time
recovery. We recovered a copy of the database to about 1 hour before the
errors started and promoted that to our production db (losing a couple of
hours of commits).

Walking through the point in time recovery we see the pg_multixact/members
folder fill up and eventually suffer a massive truncation when the newest
file overtakes the oldest file. The range from oldest to newest (through the
wrap point 14078) appears to be continuous.

Time Newest-File Oldest-File
1 am 72BA 98A4(last Oct.)
10am 72BC 98A4
11am 72BD 98A4
12pm 7C4E 98A4
1 pm 7E88 98A4
2 pm 7FFF 98A4
3 pm 884E 98A4
4 pm 905F 98A4 * here we forked our new production
4:30 94AF 98D2
5pm  984D 98D2
5:15 98D2 9900 * now errors, all files outside of 98D2-9900 are gone

During the rapid growth we were generating 10-12 pg_multixact/members files
per minute.

After we forked off a new production we identified the workload that was
causing this rapid growth and stopped it.  We then used vacuumdb -a -F to
vacuum freeze the whole database cluster. Our presumption was that this
would free up the members files, but it only freed up a few. The current
oldest file is 9E30 (~Jan. 1st) so moved up from 98A4 and the newest file is
906A.  We're concerned this might be a time-bomb waiting for us in the
future (though at the slower growth rate of 16 files a day or so potentially
a while in the future) as the members file namespace is still 95+% consumed
post vacuum freeze.  We do plan to upgrade to 9.4 sometime in the next
couple of months and are curious if we can use pg_upgrade or if we will need
to dump / restore the full (multi TiB) cluster.

As for the workload causing the rapid growth, it involved something like:
while [many millions of things to update / insert]
  BEGIN;
  SELECT state FROM table_A WHERE id = 1 FOR SHARE;
  if state != 'blocked'
    update / insert 2000-10000 rows in table_B, other stuff
    COMMIT;
  else
    COMMIT;
    sleep wait
being run in a bunch of connections (26) to the db. Row 1 of table A was
being used effectively as a share / exclusive lock as another different
process would update the state to 'blocked' to block the first process in
order to manage some shared outside the db state.

We've retained the pre-failure snapshot and wal files for now (though we
will need to free them up at some point) so we can point in time recover to
any point from 28 hours before to several hours after the problem surfaced
if that's helpful at all.

Bugs / Questions:
 - the members files wrapped over themselves leading to corruption
 - why didn't vacuum_db -a -F free up more members files?
"tgarnett@panjiva.com" <tgarnett@panjiva.com> wrote:

> - why didn't vacuum_db -a -F free up more members files?

Are you sure that while the problem was developing the primary
cluster didn't have any long-running transactions (including those
left "idle in transaction" or prepared transactions)?  Was there a
checkpoint following the vacuumdb -a -F run?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
>
>
> > - why didn't vacuum_db -a -F free up more members files?
>
> Are you sure that while the problem was developing the primary
> cluster didn't have any long-running transactions (including those
> left "idle in transaction" or prepared transactions)?  Was there a
> checkpoint following the vacuumdb -a -F run?


The vacuum_db was run after we recovered from backup (which involved
restarting the server) so at the time the vacuum_db started there were no
open transactions.  There have have been checkpoints since the vacuum_db
finished as well.

In the lead up to the problem there wouldn't have been any transactions
open more then a couple of hours (the oldest members file was over 6 months
old).

Tim
On Mon, Apr  6, 2015 at 07:21:30PM +0000, tgarnett@panjiva.com wrote:
> We've retained the pre-failure snapshot and wal files for now (though we
> will need to free them up at some point) so we can point in time recover to
> any point from 28 hours before to several hours after the problem surfaced
> if that's helpful at all.
>
> Bugs / Questions:
>  - the members files wrapped over themselves leading to corruption
>  - why didn't vacuum_db -a -F free up more members files?

Wow, you did a lot of research on this and I am glad you seem to have
found a solution.  Frankly, there were so many multi-xact bugs in 9.3.X
that I am unable to track exactly what bugs caused what failures, when
they were fixed, and whether fixes fixed the problem for good, or
whether they just fixed them from happening in the future.  Yes, very
discouraging.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + Everyone has their own god. +
tgarnett@panjiva.com wrote:

> ERROR: could not access status of transaction 303450738
> DETAIL: Could not open file "pg_multixact/members/7B49": No such file or
> directory.

Bruce and Kevin both pinged me about this issue recently.  Turns out
that I have an incomplete patch to close the problem.  Just to clarify,
this is a completely new problem, not related to #8673.

The fix is to raise an ERROR when generating a new multixact, if we
detect that doing so would get close to the oldest multixact that the
system knows about.  If that happens, the solution is to vacuum so that
the "oldest" point is advanced a bit more and you have room to generate
more multixacts.  In production, you would typically adjust the
multixact freeze parameters so that "oldest multixact" is advanced more
aggressively and you don't hit the ERROR.

A fix I pushed to master (for a performance regression reportes as bug
#8470) would make the problem less common, by having typical multixact
sizes be smaller.  I didn't backpatch that fix due to lack of feedback,
but since it is connected to this data-eating bug, maybe we should look
into doing so.  This problem only arises if your multixacts are larger
than 24 members in average (or something like that.  I don't recall the
exact number.)  That should be atypical, except that prior to the #8470
fix the multixact size is related to number of subtransactions doing
certain operations in a loop.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 13, 2015 at 5:08 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> tgarnett@panjiva.com wrote:
>
>> ERROR: could not access status of transaction 303450738
>> DETAIL: Could not open file "pg_multixact/members/7B49": No such file or
>> directory.
>
> Bruce and Kevin both pinged me about this issue recently.  Turns out
> that I have an incomplete patch to close the problem.  Just to clarify,
> this is a completely new problem, not related to #8673.
>
> The fix is to raise an ERROR when generating a new multixact, if we
> detect that doing so would get close to the oldest multixact that the
> system knows about.

I think we definitely need to do that ASAP.  And possibly then force
an immediate minor release.  Bugs that eat your data are bad, and we
have a customer hitting this completely independently of this report,
which makes this look like more than a theoretical problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, Apr 15, 2015 at 12:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> I think we definitely need to do that ASAP.  And possibly then force
> an immediate minor release.  Bugs that eat your data are bad, and we
> have a customer hitting this completely independently of this report,
> which makes this look like more than a theoretical problem.
>
>
I defer to others on what a good timeline would be, but failing harder here
would be good.  We were somewhat lucky in discovering the issue as fast as
we did.  Post wrap, 99.9+% of the queries to our database were returning
fine without error.  We only observed the failure when accessing two
specific rows in two different tables (there were possibly more, it's a
multi-TiB db cluster).  Had we not hit those rows for a few days instead of
a few hours recovery would have been extremely difficult as rolling back to
a known good state wouldn't have really been an option.

Tim
Alvaro Herrera wrote:

> The fix is to raise an ERROR when generating a new multixact, if we
> detect that doing so would get close to the oldest multixact that the
> system knows about.  If that happens, the solution is to vacuum so that
> the "oldest" point is advanced a bit more and you have room to generate
> more multixacts.  In production, you would typically adjust the
> multixact freeze parameters so that "oldest multixact" is advanced more
> aggressively and you don't hit the ERROR.

Here's a patch.  I have tested locally and it closes the issue for me.
If those affected can confirm that it stops the file removal from
happening, I'd appreciate it.

It would be much better to avoid that additional file reading, but it
seems difficult (read: I don't see how) without changing pg_control.

Note: in bootstrap.c, I had to move setting the bootstrap a bit earlier.
Otherwise, the is-in-bootstrap mode test returned false while
bootstrapping multixact, so initdb would fail on the initial phase
because of trying to read pg_multixact/offset/0000 which hasn't been
created at that point.  The amount of code that runs in the bootstrap
mode after this change that wasn't running in bootstrap mode previously
is pretty tiny and shouldn't cause any problem -- it's essentially
the whole of BootStrapXLOG().

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
Hi Alvaro

On Tue, Apr 21, 2015 at 7:04 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Here's a patch.  I have tested locally and it closes the issue for me.
> If those affected can confirm that it stops the file removal from
> happening, I'd appreciate it.

I was also starting to look at this problem.  For what it's worth,
here's a client program that I used to generate a lot of multixact
members.  The patch seems to work correctly so far: as the offset
approached wraparound, I saw the warnings first with appropriate OID
and members remaining, and then I was blocked from creating new
multixacts.

Best regards,

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Tue, Apr 21, 2015 at 12:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
>
> Alvaro Herrera wrote:
>
> > The fix is to raise an ERROR when generating a new multixact, if we
> > detect that doing so would get close to the oldest multixact that the
> > system knows about.  If that happens, the solution is to vacuum so that
> > the "oldest" point is advanced a bit more and you have room to generate
> > more multixacts.  In production, you would typically adjust the
> > multixact freeze parameters so that "oldest multixact" is advanced more
> > aggressively and you don't hit the ERROR.
>
> Here's a patch.  I have tested locally and it closes the issue for me.
> If those affected can confirm that it stops the file removal from
> happening, I'd appreciate it.
>

1. Do you think it makes sense to give warning in SetMultiXactIdLimit()
if we have already reached offsetWarnLimit as we give for multiWarnLimit?

2.
void
 MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
 {
  if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
  SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
+ else
+ DetermineSafeOldestOffset(oldestMulti);
 }

Why we need to update offset stop/warn limits for the above case?
Won't it make the warning/error (pertaining to wrap around data loss)
to appear before it is required.

3.
@@ -1921,6 +1957,12 @@ StartupMultiXact(void)
  */
  pageno = MXOffsetToMemberPage(offset);
  MultiXactMemberCtl->shared->latest_page_number = pageno;
+
+ /*
+ * compute the oldest member we need to keep around to avoid old member
+ * data overrun.
+ */
+ DetermineSafeOldestOffset(MultiXactState->oldestMultiXactId);
 }

Do we need to try determining safeoldestoffset during startup considering
that we don't allow it in recovery (StartupMultiXact() seems to be called
only during recovery)

4.
AuxiliaryProcessMain()
{
..
/*
 * XLOG operations
 */
SetProcessingMode(NormalProcessing);

switch (MyAuxProcType)
{
case CheckerProcess:
/* don't set signals, they're useless here */
CheckerModeMain();
proc_exit(1); /* should never return */

case BootstrapProcess:
SetProcessingMode(BootstrapProcessing);
bootstrap_signals();
BootStrapXLOG();
BootstrapModeMain();
proc_exit(1); /* should never return */

..
}

Looking at above code, the new setting of processing mode for
BootstrapProcessing  looks slightly unlear, basically first we set
processing mode  as Normal and then set it to BootstrapProcessing,
may be we can add a comment there.

This solution seems like a good workaround for the problem reported,
however ideally there should be some way (via Vacuum/
Checkpoint) with which this increase of files can be prevented. I think
after your patch gets committed, we can try to devise a better design
to overcome this problem.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> Here's a patch. I have tested locally and it closes the issue
> for me. If those affected can confirm that it stops the file
> removal from happening, I'd appreciate it.

Based on initial testing, it seems to stop file removal from
happening rather too well. Before applying the patch I ran a test
test that generated files 0000 to 1185D in the members directory.
Even setting vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age very low, none of the members
files would go away with VACUUM (with and without FREEZE) followed
by CHECKPOINT. After applying the patch and starting with a fresh
initdb, with very low settings of the vacuum_multixact_* GUCs I get
the new error almost immediately, while the only file in the
members subdirectory is 0000 and it is 8kB in size.

I think the calculations might be wrong, but I'm not sure what does
make sense.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Apr 21, 2015 at 12:25 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Hi Alvaro
>
> On Tue, Apr 21, 2015 at 7:04 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Here's a patch.  I have tested locally and it closes the issue for me.
>> If those affected can confirm that it stops the file removal from
>> happening, I'd appreciate it.
>
> I was also starting to look at this problem.  For what it's worth,
> here's a client program that I used to generate a lot of multixact
> members.  The patch seems to work correctly so far: as the offset
> approached wraparound, I saw the warnings first with appropriate OID
> and members remaining, and then I was blocked from creating new
> multixacts.

One thing I noticed about your patch is that it effectively halves the
amount of multixact members you can have on disk.  Sure, I'd rather
hit an error at 2^31 members than a corrupt database at 2^32 members,
but I wondered if we should try to allow the full range to be used.
I'm not sure whether there is a valid use case for such massive
amounts of pg_multixact/members data (or at least one that won't go
away if autovacuum heuristics are changed in a later patch, also I
understand that there are other recent patches that reduce member
traffic), but I if the plan is to backpatch this patch then I suppose
it should ideally not halve the amount of an important resource you
can use in existing system when people do a point upgrade.

Here's a small patch (that applies after your patch) to show how this
could be done, using three-way comparisons with an explicit boundary
to detect wraparound.  There may be other technical problems (for
example MultiXactAdvanceNextMXact still uses the
MultiXactOffsetPrecedes), or this may be a bad idea just because it
breaks with the well convention for wrap around detection established
by xids.

Also, I wanted to make sure I could reproduce the original
bug/corruption in unpatched master with the client program I posted.
Here are my notes on doing that (sorry if they belabour the obvious,
partly this is just me learning how SLRUs and multixacts work...):

========

Member wraparound happens after segment file "14078" (assuming default
page size, you get 32 pages per segment, and 1636 members per page
(409 groups of 4 + some extra data), and our max member offset wraps
after 0xffffffff, and 0xffffffff / 1636 / 32 = 82040 = 0x14078;
incidentally that final segment is a shorter one).

Using my test client with 500 sessions and 35k loops I observed this,
it wrapped back around to writing to member file "0000" after creating
"14078", which is obviously broken, because the start of member
segment "0000" holds members for multixact ID 1, which was still in
play (it was datminmxid for template0).

Looking at the members of multixact ID 1 I see recent xids:

postgres=# select pg_get_multixact_members('1'::xid);
 pg_get_multixact_members
--------------------------
 (34238661,sh)
 (34238662,sh)
(2 rows)

Note that pg_get_multixact_members knows the correct *number* of
members for multixact ID 1, it's just that it's looking at members
from some much later multixact.  By a tedious binary search I found
it:

postgres=# select pg_get_multixact_members('17094780'::xid);
 pg_get_multixact_members
--------------------------
 ... snip ...
 (34238660,sh)
 (34238661,sh) <-- here they are!
 (34238662,sh) <--
 (34238663,sh)
 ... snip ...

After a checkpoint, I saw that all the files got deleted except for a
few consecutively named files starting at "0000", which would be
correct behavior in general, if we hadn't allowed the member offset to
wrap.  It had correctly kept the segments starting with the one
holding the members of multixact ID 1 (the cluster-wide oldest) up
until the one that corresponds to MultiXactState->nextOffset.  My test
program had blown right past member offset 0xffffffff and back to 0
and then kept going.  The truncation code isn't the problem per se.

To produce the specific error message seen by the bug reporter via
normal interactions from a test program, I think we need some magic
that I can't figure out how to do yet: we need to run a query that
accesses a multixact that has member offset from before offset
wraparound, eg 0xffffffff or similar, but whose members are not on a
page that is still in memory, after a checkpoint that has unlinked the
segment file, so it can try to load it and discover that the segment
file is missing!  So a pretty complex interaction of concurrent
processes, timing and caches.

We can more artificially stimulate the error by explicitly asking for
multixact members like this though:

postgres=# select pg_get_multixact_members('10000000'::xid);
ERROR:  could not access status of transaction 10000000
DETAIL:  Could not open file "pg_multixact/members/BB55": No such file
or directory.

That's a totally valid multixact ID, obviously since it's been able to
figure out which segment to look in for its members.

Here's one that tries to open the segment that comes immediately
before "0000" in modulo numbering:

postgres=# select pg_get_multixact_members('17094770'::xid);
ERROR:  could not access status of transaction 17094770
DETAIL:  Could not open file "pg_multixact/members/14078": No such
file or directory.

If I tried it with 17094779, the multixact ID immediatly before the
one that has overwritten "0000", it does actually work, presumably
because its pages happen to be buffered for me so it doesn't try to
open the file (guessing here).

I don't currently believe it's necessary to reproduce that step via a
test program anyway, the root problem is clear enough just from
watching the thing wrap.

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Tue, Apr 21, 2015 at 5:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> Here's a patch. I have tested locally and it closes the issue
>> for me. If those affected can confirm that it stops the file
>> removal from happening, I'd appreciate it.
>
> Based on initial testing, it seems to stop file removal from
> happening rather too well. Before applying the patch I ran a test
> test that generated files 0000 to 1185D in the members directory.
> Even setting vacuum_multixact_freeze_min_age and
> vacuum_multixact_freeze_table_age very low, none of the members
> files would go away with VACUUM (with and without FREEZE) followed
> by CHECKPOINT. After applying the patch and starting with a fresh
> initdb, with very low settings of the vacuum_multixact_* GUCs I get
> the new error almost immediately, while the only file in the
> members subdirectory is 0000 and it is 8kB in size.
>
> I think the calculations might be wrong, but I'm not sure what does
> make sense.

Can anyone else reproduce this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Apr 24, 2015 at 5:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 21, 2015 at 5:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> Here's a patch. I have tested locally and it closes the issue
>>> for me. If those affected can confirm that it stops the file
>>> removal from happening, I'd appreciate it.
>>
>> Based on initial testing, it seems to stop file removal from
>> happening rather too well. Before applying the patch I ran a test
>> test that generated files 0000 to 1185D in the members directory.
>> Even setting vacuum_multixact_freeze_min_age and
>> vacuum_multixact_freeze_table_age very low, none of the members
>> files would go away with VACUUM (with and without FREEZE) followed
>> by CHECKPOINT. After applying the patch and starting with a fresh
>> initdb, with very low settings of the vacuum_multixact_* GUCs I get
>> the new error almost immediately, while the only file in the
>> members subdirectory is 0000 and it is 8kB in size.
>>
>> I think the calculations might be wrong, but I'm not sure what does
>> make sense.
>
> Can anyone else reproduce this?

Yes.  This happens in a fresh initdb-ed database.  We start out with
oldestOffset = 0, oldestOffsetStopLimit = 4294914944 (that's 0 -
safety margin), nextOffset = 0. Then the first attempt to generate a
new multixact ID fails, because
MultiXactOffsetPrecedes(oldestOffsetStopLimit, nextOffset + nmembers)
is true.  I guess the patch works correctly when you start out with an
unpatched database server and generate more than 2.whatever billion
members, and then you restart with the patch, and then start adding
more members until MultiXactOffsetPrecedes(...) returns true and you
get the error.

That's why I proposed not using xid-like logic, and instead using a
type of three-way comparison that allows you to see when nextOffset
would 'cross' oldestOffsetStopLimit, instead of the two-way comparison
that considers half the number-space to be in the past and half in the
future, in my earlier message.

--
Thomas Munro
http://www.enterprisedb.com
Thomas Munro wrote:

> That's why I proposed not using xid-like logic, and instead using a
> type of three-way comparison that allows you to see when nextOffset
> would 'cross' oldestOffsetStopLimit, instead of the two-way comparison
> that considers half the number-space to be in the past and half in the
> future, in my earlier message.

Yeah, that bit made sense to me.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 23, 2015 at 9:59 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Thomas Munro wrote:
>> That's why I proposed not using xid-like logic, and instead using a
>> type of three-way comparison that allows you to see when nextOffset
>> would 'cross' oldestOffsetStopLimit, instead of the two-way comparison
>> that considers half the number-space to be in the past and half in the
>> future, in my earlier message.
>
> Yeah, that bit made sense to me.

In addition to preventing the corruption, I think we also need a
back-patchable fix for AV to try to keep this situation from happening
in the first place.  We do not want the system to get stuck in a
situation where member usage is low, so autovacuum does nothing, but
offset usage is high, so the user just keeps getting an error whenever
they do anything that could cause a new MXID to be created.  Perhaps
if we'd thought of it we would have added yet another column to
pg_class to track the oldest offset referenced by any MXID in a
particular relation, but that sounds kind of icky on general principle
and certainly wouldn't be back-patchable.

What I think we should do is notice when members utilization exceeds
offset utilization and progressively ramp back the effective value of
autovacuum_multixact_freeze_max_age (and maybe also
vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age)
so that autovacuum (and maybe also manual vacuums) get progressively
more aggressive about trying to advance relminmxid.  Suppose we decide
that when the "members" space is 75% used, we've got a big problem and
want to treat autovacuum_multixact_freeze_max_age to effectively be
zero.  Conversely, suppose that when the members space is at a lower
usage percentage than the offsets space, or when it's anyway less than
autovacuum_multixact_freeze_max_age, we define that as completely
acceptable.  If we're somewhere above the "no problem" threshold but
below the "big problem" threshold, then we calculate what percentage
of the way we are from the "no problem" threshold to the "big problem"
threshold and reduce autovacuum_multixact_freeze_max_age by that
percentage.

Example: autovacuum_multixact_freeze_max_age is 25% of 2^32 and we're
using 20% of the offsets space but 40% of the members space.   We're
(40 - 20) / (75 - 20) = 36% of the way to the 75% threshold we never
want to exceed, so we reduce the effective value of
autovacuum_multixact_freeze_max_age by 36%.  In this case, that means
treating the configured value of 1073741824 as if it were 1073741824 *
(35/55) = 683290251.  Hopefully that's enough to trigger enough
vacuuming to cause some relminmxid advancement, but if not, and the
situation continues to worsen, we'll get more and more aggressive.

This may not be the right proposal in detail, but I think we should do
something.  I don't like the idea telling users that they can no
longer count on autovacuum to prevent wraparound, and that if they
don't manually tune their vacuum settings correctly, their system may
just stop working at some point.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Apr 21, 2015 at 5:16 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> Here's a patch. I have tested locally and it closes the issue
>>> for me. If those affected can confirm that it stops the file
>>> removal from happening, I'd appreciate it.
>>
>> Based on initial testing, it seems to stop file removal from
>> happening rather too well. Before applying the patch I ran a test
>> test that generated files 0000 to 1185D in the members directory.
>> Even setting vacuum_multixact_freeze_min_age and
>> vacuum_multixact_freeze_table_age very low, none of the members
>> files would go away with VACUUM (with and without FREEZE) followed
>> by CHECKPOINT. After applying the patch and starting with a fresh
>> initdb, with very low settings of the vacuum_multixact_* GUCs I get
>> the new error almost immediately, while the only file in the
>> members subdirectory is 0000 and it is 8kB in size.
>>
>> I think the calculations might be wrong, but I'm not sure what does
>> make sense.

> Can anyone else reproduce this?

I think I see why I was seeing this and nobody else was -- I was
testing the cleanup on an otherwise quiescent cluster.  It seems
that no amount of VACUUM and CHECKPOINT will clean up the members
subdirectory in the absence of processes adding more members.  But
after performing the VACUUMs and CHECKPOINT if I start the test
program to add more multi-transactions with lots of members, *then*
the members subdirectory gets cleaned up.  That seems confusing and
a bit annoying, but is otherwise benign.  I would put "allow VACUUM
followed by CHECKPOINT to delete unneeded files from the members
subdirectory" on the "nice to have" list, but would not want to
delay a fix for the corruption issues for it.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> I think I see why I was seeing this and nobody else was

Thomas said he reproduced it.  No?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgrittn@ymail.com> wrote=
:
>> I think I see why I was seeing this and nobody else was
>
> Thomas said he reproduced it.  No?

I should have been more clear about what I meant by "this".  Thomas
said he reproduced the immediate errors with =C3=81lvaro's patch, but if
he said anything about files in the members subdirectory not going
away with VACUUM followed by CHECKPOINT, regardless of
configuration, I missed it.  It turns out that these steps only
"prime the pump" for the files to be deleted on subsequent access
to the members SLRU.  That doesn't contribute to database
corruption, but it sure can be confusing for someone trying to
clean things up.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas wrote:
> On Thu, Apr 23, 2015 at 9:59 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Thomas Munro wrote:
> >> That's why I proposed not using xid-like logic, and instead using a
> >> type of three-way comparison that allows you to see when nextOffset
> >> would 'cross' oldestOffsetStopLimit, instead of the two-way comparison
> >> that considers half the number-space to be in the past and half in the
> >> future, in my earlier message.
> >
> > Yeah, that bit made sense to me.
>
> In addition to preventing the corruption, I think we also need a
> back-patchable fix for AV to try to keep this situation from happening
> in the first place.

Let me push a patch to fix the corruption, and then we can think of ways
to teach autovacuum about the problem.  I'm not optimistic about that,
honestly: as all GUC settings, these are individual for each process,
and there's no way for one process to affect the values that are seen by
other processes (autovac workers).  The only idea that comes to mind is
to publish values in shared memory, and autovac workers would read them
from there instead of using normal GUC values.

> What I think we should do is notice when members utilization exceeds
> offset utilization and progressively ramp back the effective value of
> autovacuum_multixact_freeze_max_age (and maybe also
> vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age)
> so that autovacuum (and maybe also manual vacuums) get progressively
> more aggressive about trying to advance relminmxid.  Suppose we decide
> that when the "members" space is 75% used, we've got a big problem and
> want to treat autovacuum_multixact_freeze_max_age to effectively be
> zero.

I think we can easily determine the rate of multixact member space
consumption and compare to the rate of multixact ID consumption;
considering the historical multixact size (number of members per
multixact) it would be possible to change the freeze ages by the same
fraction, so that autovac effectively behaves as if the members
consumption rate is what is driving the freezing instead of ID
consumption rate.  That way, we don't have to jump suddenly from
"normal" to "emergency" behavior as some fixed threshold.

> This may not be the right proposal in detail, but I think we should do
> something.

No disagreement on that.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Kevin Grittner wrote:
> Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Apr 24, 2015 at 5:34 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> >> I think I see why I was seeing this and nobody else was
> >
> > Thomas said he reproduced it.  No?
>
> I should have been more clear about what I meant by "this".  Thomas
> said he reproduced the immediate errors with Álvaro's patch, but if
> he said anything about files in the members subdirectory not going
> away with VACUUM followed by CHECKPOINT, regardless of
> configuration, I missed it.  It turns out that these steps only
> "prime the pump" for the files to be deleted on subsequent access
> to the members SLRU.  That doesn't contribute to database
> corruption, but it sure can be confusing for someone trying to
> clean things up.

The whole matter of truncating multixact is a longish trip.  It starts
when autovacuum completes a round or VACUUM finishes processing a table;
these things call vac_update_datfrozenxid.  That routine scans pg_class
and sees if datfrozenxid or datminmxid can be advanced from their
current points; only if any of these can, vac_truncate_clog is called.
That routine calls SetMultiXactIdLimit(), which determines a new
MultiXactState->oldestMultiXactId (saved in shared memory).  The
involvement of vacuum stops here; following steps happen at checkpoint.

At checkpoint, oldestMultiXactId is saved to pg_control as part of a
checkpoint (MultiXactGetCheckptMulti); the checkpointed value is passed
back to multixact by MultiXactSetSafeTruncate, which saves it in shmem
as lastCheckpointedOldest.  The same checkpoint later calls
TruncateMultiXact which can remove files.

Note that if vac_update_datfrozenxid finds that the pg_database values
cannot be changed (during the vacuum phase), the multixact truncation
point is not advanced and checkpoint has nothing to do.  But note that
the clog counter advancing enough will also trigger multixact
truncation.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Apr 27, 2015 at 10:59 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Let me push a patch to fix the corruption, and then we can think of ways
> to teach autovacuum about the problem.

Sounds good to me.  Are you going to do that today?

> I'm not optimistic about that,
> honestly: as all GUC settings, these are individual for each process,
> and there's no way for one process to affect the values that are seen by
> other processes (autovac workers).  The only idea that comes to mind is
> to publish values in shared memory, and autovac workers would read them
> from there instead of using normal GUC values.

I don't think we could store values for the parameters directly in
shared memory, because I think that at least some of those GUCs are
per-session changeable.  But we might be able to store weighting
factors in shared memory that get applied to whatever the values in
the current session are.  Or else maybe each backend can just
recompute the information for itself when it needs it.

>> What I think we should do is notice when members utilization exceeds
>> offset utilization and progressively ramp back the effective value of
>> autovacuum_multixact_freeze_max_age (and maybe also
>> vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age)
>> so that autovacuum (and maybe also manual vacuums) get progressively
>> more aggressive about trying to advance relminmxid.  Suppose we decide
>> that when the "members" space is 75% used, we've got a big problem and
>> want to treat autovacuum_multixact_freeze_max_age to effectively be
>> zero.
>
> I think we can easily determine the rate of multixact member space
> consumption and compare to the rate of multixact ID consumption;
> considering the historical multixact size (number of members per
> multixact) it would be possible to change the freeze ages by the same
> fraction, so that autovac effectively behaves as if the members
> consumption rate is what is driving the freezing instead of ID
> consumption rate.  That way, we don't have to jump suddenly from
> "normal" to "emergency" behavior as some fixed threshold.

Right.  I think that not jumping from normal mode to emergency mode is
quite important, and was trying to describe a system that would
gradually ramp up the pressure rather than a system that would do
nothing for a while and then suddenly go ballistic.

With regard to what you've outlined here, we need to make sure that if
the multixact rate varies widely, we still clean up before we hit
autovac wraparond.  That's why I think it should be driven off of the
fraction of the available address space which is currently consumed,
not some kind of short term measure of mxact size or generation rate.
I'm not sure exactly what you have in mind here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, 27 Apr 2015 11:59:10 -0300
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> I think we can easily determine the rate of multixact member space
> consumption and compare to the rate of multixact ID consumption;
> considering the historical multixact size (number of members per
> multixact) it would be possible to change the freeze ages by the same
> fraction, so that autovac effectively behaves as if the members
> consumption rate is what is driving the freezing instead of ID
> consumption rate.  That way, we don't have to jump suddenly from
> "normal" to "emergency" behavior as some fixed threshold.

I would like to add a data point: one of my clients has a plpgsql function
that manages to use ten to 30 thousand multixact ids per invocation. It
interacts with a remote resource and sets an exception handler on a per
item basis to catch errors on the remote call.

-dg


--
David Gould              510 282 0869         daveg@sonic.net
If simplicity worked, the world would be overrun with insects.
On Tue, Apr 28, 2015 at 4:40 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 27, 2015 at 10:59 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> I think we can easily determine the rate of multixact member space
>> consumption and compare to the rate of multixact ID consumption;
>> considering the historical multixact size (number of members per
>> multixact) it would be possible to change the freeze ages by the same
>> fraction, so that autovac effectively behaves as if the members
>> consumption rate is what is driving the freezing instead of ID
>> consumption rate.  That way, we don't have to jump suddenly from
>> "normal" to "emergency" behavior as some fixed threshold.
>
> Right.  I think that not jumping from normal mode to emergency mode is
> quite important, and was trying to describe a system that would
> gradually ramp up the pressure rather than a system that would do
> nothing for a while and then suddenly go ballistic.
>
> With regard to what you've outlined here, we need to make sure that if
> the multixact rate varies widely, we still clean up before we hit
> autovac wraparond.  That's why I think it should be driven off of the
> fraction of the available address space which is currently consumed,
> not some kind of short term measure of mxact size or generation rate.
> I'm not sure exactly what you have in mind here.

Here is a work-in-progress-patch that:

(1) Fixes the boundary bug already discussed (as before, except that I
fixed up the comments in MultiXactOffsetWouldWrap() to survive
pgindent based on feedback from Kevin).

(2) Makes autovacuum adjust the effective value of
autovacuum_multixact_freeze_max_age down to zero as the fraction of
used addressable member offset range approaches 75%, and also makes
vacuum_multixact_freeze_min_age use the same value if it is lower than
the configured min age.

When I run this with my explode_mxact_members.c program, autovacuum
moves datminmxid forwards, avoiding the error.  The algorithm in
autovacuum_multixact_freeze_max_age_adjusted is possibly too simple,
but it should at least spread out wraparound autovacuums rather than
suddenly going ballistic.

Do you think it's OK that MultiXactAdvanceNextMXact still uses
MultiXactOffsetPrecedes to compare two member offsets?  Can the value
from an XLog record being replayed be more than 2.2 billion away from
MultiXactState->nextOffset?

(The attachment memberswrap-2.patch is a diff against master, and
memberswrap-2-incremental.patch is a diff against master with Alvaro's
patch applied).

--
Thomas Munro
http://www.enterprisedb.com

Attachment
Thomas Munro wrote:

> One thing I noticed about your patch is that it effectively halves the
> amount of multixact members you can have on disk.  Sure, I'd rather
> hit an error at 2^31 members than a corrupt database at 2^32 members,
> but I wondered if we should try to allow the full range to be used.

Ah, yeah, we do want the full range; that's already built in the code
elsewhere.

In this version, I used your WouldWrap function, but there was a bug in
your formulation of the call site: after the WARNING has been issued
once, it is never issued again for that wraparound cycle, because the
second time around the nextOffset has already crossed the boundary and
your routine returns false.  IMO this is wrong and the warning should be
issued every time.  To fix that problem I removed the offsetWarnLimit
altogether, and instead do WouldWrap() of the value against
offsetStopLimit minus the 20 segments.  That way, the warning is issued
continuously until the offsetStopLimit is reached (once there,
obviously, only the error is thrown, not the warning, which is correct.)

I also added a call to DetermineSafeOldestOffset() in TrimMultiXact:
as far as I can tell, this is necessary for the time when a standby
exits recovery, because when InRecovery we return early from
DetermineSafeOldestOffset() so the safe point would never get set.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
On Tue, Apr 28, 2015 at 6:23 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Thomas Munro wrote:
>
>> One thing I noticed about your patch is that it effectively halves the
>> amount of multixact members you can have on disk.  Sure, I'd rather
>> hit an error at 2^31 members than a corrupt database at 2^32 members,
>> but I wondered if we should try to allow the full range to be used.
>
> Ah, yeah, we do want the full range; that's already built in the code
> elsewhere.
>
> In this version, I used your WouldWrap function, but there was a bug in
> your formulation of the call site: after the WARNING has been issued
> once, it is never issued again for that wraparound cycle, because the
> second time around the nextOffset has already crossed the boundary and
> your routine returns false.  IMO this is wrong and the warning should be
> issued every time.  To fix that problem I removed the offsetWarnLimit
> altogether, and instead do WouldWrap() of the value against
> offsetStopLimit minus the 20 segments.  That way, the warning is issued
> continuously until the offsetStopLimit is reached (once there,
> obviously, only the error is thrown, not the warning, which is correct.)

+1

Tomorrow I will send a separate patch for the autovacuum changes that
I sent earlier.  Let's discuss and hopefully eventually commit that
separately.

--
Thomas Munro
http://www.enterprisedb.com
On Tue, Apr 28, 2015 at 2:23 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Ah, yeah, we do want the full range; that's already built in the code
> elsewhere.
>
> In this version, I used your WouldWrap function, but there was a bug in
> your formulation of the call site: after the WARNING has been issued
> once, it is never issued again for that wraparound cycle, because the
> second time around the nextOffset has already crossed the boundary and
> your routine returns false.  IMO this is wrong and the warning should be
> issued every time.  To fix that problem I removed the offsetWarnLimit
> altogether, and instead do WouldWrap() of the value against
> offsetStopLimit minus the 20 segments.  That way, the warning is issued
> continuously until the offsetStopLimit is reached (once there,
> obviously, only the error is thrown, not the warning, which is correct.)
>
> I also added a call to DetermineSafeOldestOffset() in TrimMultiXact:
> as far as I can tell, this is necessary for the time when a standby
> exits recovery, because when InRecovery we return early from
> DetermineSafeOldestOffset() so the safe point would never get set.

Putting the period inside the parentheses here looks weird?

+                                  "This command would create a
multixact with %u members, which exceeds remaining space (%u
members.)",

Maybe rephrase as: "This command would create a multixact with %u
members, but the remaining space is only enough for %u members."

I don't think this should have a comma:

+                 errhint("Execute a database-wide VACUUM in that
database, with reduced vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.")));

This looks like excess brace-ifiaction:

+    if (start < boundary)
+    {
+        return finish >= boundary || finish < start;
+    }
+    else
+    {
+        return finish >= boundary && finish < start;
+    }

I think this is confusing:


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, Apr 28, 2015 at 10:33 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think this is confusing:

Oops, hit send too soon.

+/*
+ * Read the offset of the first member of the given multixact.
+ */

This is confusing to me because the two subdirectories of pg_multixact
are called "members" and "offsets".  Here you are talking about the
offset of the first member.  Maybe I'm just slow, but that seems like
conflating terminology.  You end up with a function called
read_offset_for_multi() that is actually looking up information about
members.  Ick.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Thomas Munro wrote:

> > In this version, I used your WouldWrap function, [...]
>
> +1

Pushed.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
I sure wish this had arrived two minutes earlier ...

Robert Haas wrote:

> Putting the period inside the parentheses here looks weird?
>
> +                                  "This command would create a
> multixact with %u members, which exceeds remaining space (%u
> members.)",
>
> Maybe rephrase as: "This command would create a multixact with %u
> members, but the remaining space is only enough for %u members."

WFM, will change.

> I don't think this should have a comma:
>
> +                 errhint("Execute a database-wide VACUUM in that
> database, with reduced vacuum_multixact_freeze_min_age and
> vacuum_multixact_freeze_table_age settings.")));

Ditto.

> This looks like excess brace-ifiaction:
>
> +    if (start < boundary)
> +    {
> +        return finish >= boundary || finish < start;
> +    }
> +    else
> +    {
> +        return finish >= boundary && finish < start;
> +    }

Yeah, agreed.  Will undo that change.  (I disliked the comment above the
indented single-statement, so added braces, but then moved the comment.
I should have removed the braces at that point.)

> I think this is confusing:
>
> +/*
> + * Read the offset of the first member of the given multixact.
> + */
>
> This is confusing to me because the two subdirectories of pg_multixact
> are called "members" and "offsets".  Here you are talking about the
> offset of the first member.  Maybe I'm just slow, but that seems like
> conflating terminology.  You end up with a function called
> read_offset_for_multi() that is actually looking up information about
> members.  Ick.

Yeah, I introduced the confusing terminology while inventing multixacts
initially and have regretted it many times.  I will think about a better
name for this.  (Meanwhile, on IM Robert suggested
find_start_of_first_multi_member)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 28, 2015 at 10:56 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> I sure wish this had arrived two minutes earlier ...

Sorry about that.  :-)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Alvaro Herrera wrote:

> > I think this is confusing:
> >
> > +/*
> > + * Read the offset of the first member of the given multixact.
> > + */
> >
> > This is confusing to me because the two subdirectories of pg_multixact
> > are called "members" and "offsets".  Here you are talking about the
> > offset of the first member.  Maybe I'm just slow, but that seems like
> > conflating terminology.  You end up with a function called
> > read_offset_for_multi() that is actually looking up information about
> > members.  Ick.
>
> Yeah, I introduced the confusing terminology while inventing multixacts
> initially and have regretted it many times.  I will think about a better
> name for this.  (Meanwhile, on IM Robert suggested
> find_start_of_first_multi_member)

Pushed.  I chose find_multixact_start() as a name for this function.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Apr 28, 2015 at 10:54 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

> Alvaro Herrera wrote:
>
> > > I think this is confusing:
> > >
> > > +/*
> > > + * Read the offset of the first member of the given multixact.
> > > + */
> > >
> > > This is confusing to me because the two subdirectories of pg_multixact
> > > are called "members" and "offsets".  Here you are talking about the
> > > offset of the first member.  Maybe I'm just slow, but that seems like
> > > conflating terminology.  You end up with a function called
> > > read_offset_for_multi() that is actually looking up information about
> > > members.  Ick.
> >
> > Yeah, I introduced the confusing terminology while inventing multixacts
> > initially and have regretted it many times.  I will think about a better
> > name for this.  (Meanwhile, on IM Robert suggested
> > find_start_of_first_multi_member)
>
> Pushed.  I chose find_multixact_start() as a name for this function.
>


Starting with  commit b69bf30b9bfacafc733a9ba7 and continuing to this
just-described commit, I can no longer upgrade from a 9.2.10 database using
pg_upgrade.

I can reproduce it from a clean 9.2 install which has never even been
started up.

Deleting files from new pg_multixact/offsets                ok
Setting oldest multixact ID on new cluster                  ok
Resetting WAL archives                                      ok

*failure*
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

The last few lines are:

command: "../bisect/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D
"../data2/" -o "-p 50432 -b -c synchronous_commit=off -c fsync=off -c
full_page_writes=off  -c listen_addresses='' -c
unix_socket_permissions=0700 -c
unix_socket_directories='/home/jjanes/pgsql/git'" start >>
"pg_upgrade_server.log" 2>&1
waiting for server to start....LOG:  database system was shut down at
2015-04-28 11:08:18 PDT
FATAL:  could not access status of transaction 1
DETAIL:  Could not open file "pg_multixact/offsets/0000": No such file or
directory.
LOG:  startup process (PID 3977) exited with exit code 1
LOG:  aborting startup due to startup process failure


Cheers,

Jeff
Jeff Janes wrote:

> Starting with  commit b69bf30b9bfacafc733a9ba7 and continuing to this
> just-described commit, I can no longer upgrade from a 9.2.10 database using
> pg_upgrade.

How annoying, thanks for the report.  I reproduced it here.  The problem
is that the upgrade process removes the files from pg_multixact/offsets,
which is what we now want to read on startup.  Not yet sure how to fix
it.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Jeff Janes wrote:
> Starting with  commit b69bf30b9bfacafc733a9ba7 and continuing to this
> just-described commit, I can no longer upgrade from a 9.2.10 database using
> pg_upgrade.

Here's a patch, but I don't like it too much.  Will think more about it,
probably going to push something tomorrow.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment
On Tue, Apr 28, 2015 at 6:30 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Tomorrow I will send a separate patch for the autovacuum changes that
> I sent earlier.  Let's discuss and hopefully eventually commit that
> separately.

Here is a separate patch which makes autovacuum start a wrap-around
vacuum sooner if the member space is running out, by adjusting
autovacuum_multixact_freeze_max_age using a progressive scaling
factor.  This version includes a clearer implementation of
autovacuum_multixact_freeze_max_age_adjusted provided by Kevin
Grittner off-list.

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Tue, Apr 21, 2015 at 5:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Apr 21, 2015 at 12:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>>
>> Alvaro Herrera wrote:
>>
>> > The fix is to raise an ERROR when generating a new multixact, if we
>> > detect that doing so would get close to the oldest multixact that the
>> > system knows about.  If that happens, the solution is to vacuum so that
>> > the "oldest" point is advanced a bit more and you have room to generate
>> > more multixacts.  In production, you would typically adjust the
>> > multixact freeze parameters so that "oldest multixact" is advanced more
>> > aggressively and you don't hit the ERROR.
>>
>> Here's a patch.  I have tested locally and it closes the issue for me.
>> If those affected can confirm that it stops the file removal from
>> happening, I'd appreciate it.
>>
>
> 1. Do you think it makes sense to give warning in SetMultiXactIdLimit()
> if we have already reached offsetWarnLimit as we give for multiWarnLimit?

Amit and I discussed this offline.  Yes, we could include a warning
message here, for consistency with the warnings you get about xid
wraparound.  Concretely I think it means that you would also get
warnings about being being near the member space limit from vacuums,
rather than just from attempts to allocate new multixact IDs.  The
test to detect an impending members-would-wrap ERROR would be similar
to what we do in GetNewMultiXactId, so something like:

    MultiXactOffsetWouldWrap(offsetStopLimit,
        nextOffset,
        MULTIXACT_MEMBERS_PER_PAGE * SLRU_PAGES_PER_SEGMENT *
OFFSET_WARN_SEGMENTS)

I'm not sure whether it's worth writing an extra patch for this
though, because if you're in this situation, your logs are already
overflowing with warnings from the regular backends that are
generating multixacts.  Thoughts anyone?

--
Thomas Munro
http://www.enterprisedb.com
On Wed, Apr 29, 2015 at 5:44 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
>
> On Tue, Apr 28, 2015 at 6:30 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > Tomorrow I will send a separate patch for the autovacuum changes that
> > I sent earlier.  Let's discuss and hopefully eventually commit that
> > separately.
>
> Here is a separate patch which makes autovacuum start a wrap-around
> vacuum sooner if the member space is running out, by adjusting
> autovacuum_multixact_freeze_max_age using a progressive scaling
> factor.  This version includes a clearer implementation of
> autovacuum_multixact_freeze_max_age_adjusted provided by Kevin
> Grittner off-list.
>

Some comments:

1. It seems that you are using
autovacuum_multixact_freeze_max_age_adjusted()
only at couple of places, like it is not used in below calc:

vacuum_set_xid_limits()
{
..
mxid_freezemin = Min(mxid_freezemin,
 autovacuum_multixact_freeze_max_age / 2);
..
}

What is the reason of using this calculation at some places and
not at other places?

2.
@@ -2684,8 +2719,8 @@ relation_needs_vacanalyze(Oid relid,
  : autovacuum_freeze_max_age;

  multixact_freeze_max_age = (relopts && relopts->multixact_freeze_max_age
>= 0)
- ? Min(relopts->multixact_freeze_max_age,
autovacuum_multixact_freeze_max_age)
- : autovacuum_multixact_freeze_max_age;
+ ? Min(relopts->multixact_freeze_max_age,
autovacuum_multixact_freeze_max_age_adjusted())
+ : autovacuum_multixact_freeze_max_age_adjusted();


It seems that it will try to read from offset file for each
relation which might or might not be good, shall we try to
cache the oldestMultiXactMemberOffset?

3. currently there is some minimum limit of autovacuum_multixact_freeze_age
(10000000)
which might not be honored by this calculation, so not sure if that can
impact the
system performance in some cases where it is currently working sane.

4. Can you please share results that can show improvement
with current patch versus un-patched master?

5.
+ /*
+ * TODO: In future, could oldestMultiXactMemberOffset be stored in shmem,
+ *
pg_controdata, alongside oldestMultiXactId?
+ */

You might want to write the comment as:
XXX: We can store oldestMultiXactMemberOffset in shmem, pg_controldata
alongside oldestMultiXactId?

6.
+ * Returns vacuum_multixact_freeze_max_age, adjusted down to prevent
excessive use
+ * of addressable
multixact member space if required.

I think here you mean autovacuum_multixact_freeze_max_age?

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Tue, Apr 28, 2015 at 4:13 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

> Jeff Janes wrote:
> > Starting with  commit b69bf30b9bfacafc733a9ba7 and continuing to this
> > just-described commit, I can no longer upgrade from a 9.2.10 database
> using
> > pg_upgrade.
>
> Here's a patch, but I don't like it too much.  Will think more about it,
> probably going to push something tomorrow.
>

It looks like that patch is targeted to 9.4 branch.  I couldn't readily get
it to apply on HEAD.  I tested it on 9.4, and it solved the problem there.

Thanks,

Jeff
On Tue, Apr 28, 2015 at 11:24 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
>
> Alvaro Herrera wrote:
>
>
> Pushed.  I chose find_multixact_start() as a name for this function.
>

I have done test to ensure that the latest change has fixed the
reported problem and below are the results, to me it looks the
reported problem is fixed.

I have used test (explode_mxact_members) developed by Thomas
to reproduce the problem.  Start one transaction in a session.
After running the test for 3~4 hours with parameters as
explode_mxact_members 500 35000, I could see the warning messages
like below (before the fix there were no such messages and test is
completed but it has corrupted the database):

WARNING:  database with OID 1 must be vacuumed before 358 more multixact
members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced
vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 1 must be vacuumed before 310 more multixact
members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced
vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 1 must be vacuumed before 261 more multixact
members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced
vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 1 must be vacuumed before 211 more multixact
members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced
vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 1 must be vacuumed before 160 more multixact
members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced
vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.
explode_mxact_members: explode_mxact_members.c:38: main: Assertion
`PQresultStatus(res) == PGRES_TUPLES_OK'
failed.

After this I  set the vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age as zero and then performed
Vacuum freeze for template1 and postgres followed by
manual CHECKPOINT.  I could see below values in pg_database.

postgres=# select oid,datname,datminmxid from pg_database;
  oid  |  datname  | datminmxid
-------+-----------+------------
     1 | template1 |   17111262
 13369 | template0 |   17111262
 13374 | postgres  |   17111262
(3 rows)

Again I start the test as ./explode_mxact_members 500 35000, but it
immediately failed as
500 sessions connected...
Loop 0...
WARNING:  database with OID 13369 must be vacuumed before 12 more multixact
members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced
vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 13369 must be vacuumed before 11 more multixact
members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced
vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.
WARNING:  database with OID 13369 must be vacuumed before 9 more multixact
members are used
HINT:  Execute a database-wide VACUUM in that database, with reduced
vacuum_multixact_freeze_min_age and
vacuum_multixact_freeze_table_age settings.
explode_mxact_members: explode_mxact_members.c:38: main: Assertion
`PQresultStatus(res) == PGRES_TUPLES_OK'
failed.

Now it was confusing for me why it has failed for next time even
though I had Vacuum Freeze and CHECKPOINT, but then I waited
for a minute or two and ran Vacuum Freeze by below command:
./vacuumdb -a -F
vacuumdb: vacuuming database "postgres"
vacuumdb: vacuuming database "template1"

Here I have verified that all files except one were deleted.

After that when I restarted the test, it went perfectly fine and it never
lead to any warning messages, probable because the values for
vacuum_multixact_freeze_min_age and vacuum_multixact_freeze_table_age
were zero.

I am still not sure why it took some time to clean the members directory
and resume the test after running Vacuum Freeze and Checkpoint.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, Apr 29, 2015 at 11:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> 1. It seems that you are using
> autovacuum_multixact_freeze_max_age_adjusted()
> only at couple of places, like it is not used in below calc:
>
> vacuum_set_xid_limits()
> {
> ..
> mxid_freezemin = Min(mxid_freezemin,
> autovacuum_multixact_freeze_max_age / 2);
> ..
> }
>
> What is the reason of using this calculation at some places and
> not at other places?

You're right, I will review the other places where that variable is
used and come back on this point.

> 2.
> @@ -2684,8 +2719,8 @@ relation_needs_vacanalyze(Oid relid,
>   : autovacuum_freeze_max_age;
>
>   multixact_freeze_max_age = (relopts && relopts->multixact_freeze_max_age
>>= 0)
> - ? Min(relopts->multixact_freeze_max_age,
> autovacuum_multixact_freeze_max_age)
> - : autovacuum_multixact_freeze_max_age;
> + ? Min(relopts->multixact_freeze_max_age,
> autovacuum_multixact_freeze_max_age_adjusted())
> + : autovacuum_multixact_freeze_max_age_adjusted();
>
>
> It seems that it will try to read from offset file for each
> relation which might or might not be good, shall we try to
> cache the oldestMultiXactMemberOffset?

True.  I will look into this.

> 3. currently there is some minimum limit of autovacuum_multixact_freeze_age
> (10000000)
> which might not be honored by this calculation, so not sure if that can
> impact the
> system performance in some cases where it is currently working sane.

The reason why we need to be able to set the effective freeze age
below that minimum in cases of high member data consumption rates is
that you could hit the new member space wraparound prevention error
before you consume anywhere near that many multixact IDs.  That
minimum may well be entirely reasonable if the only thing you're
worried about is multixact ID wraparound prevention.

For example, my test program eats an average of 250 members per
multixact ID when run with 500 sessions (each loop creates 500
multixact IDs having 1, 2, 3, ..., 500 members).  At that rate, you'll
run out of addressable member space after 2^32 / 250 = 17,179,869
multixact IDs.  To prevent an error condition using only the existing
multixact ID wraparound prevention machinery, we need to have an
effective max table age (so that autovacuum wakes up and scans all
tables) and min freeze age (so that it actually freezes the tuples)
below that number.  So we have to ignore the GUC minimum in this
situation.

> 4. Can you please share results that can show improvement
> with current patch versus un-patched master?

Here is my test procedure: initdb, default settings except
max_connections = 600 and shared_buffers = 1280MB, "CREATE TABLE foo
AS SELECT 42 AS id;", start monitor.sh (attached) and then run
explode_mxact_members.c (attached to my first message in this thread)
with arguments 500 100000.

Here are the edited highlights of the result for the patched server
(the complete results are attached):

time segments usage_fraction usage_kb oldest_mxid next_mxid next_offset
10:37:01 1 0 16 1 1 0
10:38:01 656 .0079 168368 1 1 0
10:39:01 1426 .0173 366112 1 190619 47845119
10:40:01 2201 .0268 565080 1 384730 96566980
10:41:01 2979 .0363 764760 1 578342 145163592
...
11:58:57 60455 .7368 15529256 1 12601248 3162912998
11:59:59 61184 .7457 15717760 1 12601248 3162912998
12:01:01 61918 .7547 15906288 1 12794860 3211509610
12:02:03 2178 .0265 559360 12610230 12988472 3260106222 <-- oldest_mxid moved
12:03:03 2936 .0357 753936 12610230 13182583 3308828083
12:04:03 3702 .0451 950704 12610230 13376694 3357549944
...
13:18:23 58236 .7098 14956680 12610230 24623156 1885444610
13:19:23 58965 .7187 15143520 12610230 24817267 1934166471
13:20:24 59690 .7275 15329864 12610230 25011378 1982888332
13:21:24 59709 .7278 15334664 12758932 25204990 2031484944 <-- oldest_mxid moved
13:22:25 60447 .7367 15524464 12758932 25204990 2031484944
13:23:25 60836 .7415 15624192 12833782 25399101 2080206805 <-- oldest_mxid moved
13:24:26 25068 .3055 6438312 20447005 25592713 2128803417
13:25:26 25826 .3147 6632888 20447005 25786824 2177525278
13:26:26 26583 .3240 6827456 20447005 25980935 2226247139
...
14:10:42 59115 .7205 15182368 20447005 32767834 3929758788
14:11:42 59836 .7293 15367912 20447005 32767834 3929758788
14:12:43 60571 .7383 15556680 20447005 32961446 3978355400
14:13:43 38557 .4699 9902472 25189521 33155557 4027077261 <-- oldest_mxid moved
14:14:44 39267 .4786 10084408 25189521 33349169 4075673873
14:15:44 39975 .4872 10266816 25189521 33349169 4075673873
...
14:43:56 59805 .7289 15359072 25189521 37615619 851585527
14:44:56 60492 .7373 15536024 25189521 37615619 851585527
14:45:57 60468 .7370 15529536 25342215 37809231 900182139
14:46:58 48309 .5888 12406904 28023821 38003342 948904000 <-- oldest_mxid moved
14:47:58 49010 .5973 12586784 28023821 38196954 997500612
14:48:58 49716 .6059 12768152 28023821 38391065 1046222473
...
15:02:04 58882 .7177 15122600 28023821 40136068 1484218226
15:03:04 59588 .7263 15303856 28023821 40330179 1532940087
15:04:05 60293 .7349 15484592 28023821 40523791 1581536699
15:05:06 24657 .3005 6332800 35601635 40717902 1630258560 <-- oldest_mxid moved
15:06:06 25397 .3095 6522664 35601635 40717902 1630258560
15:07:06 26140 .3186 6713296 35601635 40912013 1678980421
...
15:52:22 58685 .7153 15071912 35601635 47698912 3382492070
15:53:23 59403 .7240 15255936 35601635 47892524 3431088682
15:54:23 60118 .7327 15439648 35601635 48086635 3479810543 <-- oldest_mxid moved
15:55:24 50415 .6145 12947712 37776297 48280247 3528407155
15:56:24 51144 .6234 13135568 37776297 48280247 3528407155
15:57:25 51869 .6322 13321872 37776297 48474358 3577129016
...

Observations:

1.  Sometimes the values don't change from minute to minute,
presumably because there hasn't been a checkpoint to update
pg_controldata on disk, but hopefully we can still see what's going on
here despite the slight lag in the data.

2.  We get to somewhere in the 73-75% SLRU used range before
wraparound vacuums are triggered.  We probably need to spread things
out more that that.

3.  When the autovacuum runs, it advances oldest_mxid by different
amounts each time; that's because I'm using the adjusted freeze max
age (the max age of a table before it gets a wraparound vacuum) as our
freeze min age (the max age for individual tuples before they're
frozen) here:

@@ -1931,7 +1964,9 @@ do_autovacuum(void)
  {
  default_freeze_min_age = vacuum_freeze_min_age;
  default_freeze_table_age = vacuum_freeze_table_age;
- default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
+ default_multixact_freeze_min_age =
+ Min(vacuum_multixact_freeze_min_age,
+ autovacuum_multixact_freeze_max_age_adjusted());
  default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
  }

Without that change, autovacuum would trigger repeatedly as we got
near 75% SLRU usage but not freeze anything, because
default_multixact_freeze_min_age was higher than the age of any tuples
(which had only made it to an age of around ~12 million; actually it's
not exactly the tuple age per se... I don't fully understand the
treatment of locker and updater multixact IDs in the vacuum code,
HeapTupleSatisfiesVacuum and heap_freeze_tuple etc yet so I'm not sure
exactly how that value translates into vacuum work, but I can see
experimentally that a low multixact freeze min age is needed to get
relminxmid moved forward).

It's good that freeze table age ramps down so that the autovacuum
launcher trigger point jumps around a bit and we spread the autovacuum
launches over time, but it's not great that we finish up truncating
different amounts of multixacts and associated SLRU each time.  We
could instead use a freeze min age of 0 to force freezing of *all*
tuples if this is a member-space-wraparound-prevention vacuum (that
is, if autovacuum_multixact_freeze_max_age !=
autovacuum_multixact_freeze_max_age_adjusted()).  IIUC that'd be like
running VACUUM FREEZE and would space out our wraparound vacuums as
far apart as possible since it would trims SLRU space as much possible
when a wraparound vacuum runs, but it would presumably generate more
writes.  Maybe it doesn't matter that we drop different amounts of
multixact history each time, as long as the long term behaviour works.
Thoughts?

There is less to say about the results with an unpatched server: it
drives in a straight line for a while, and then crashes into a wall
(ie the new error preventing member wraparound), which I see you have
also reproduced.  It's used up all of the circular member space, but
only has around 17 million multixacts so autovacuum can't help you
(it's not even possible to set autovacuum_multixact_freeze_max_age
below 100 million), so to get things moving again you need to manually
VACUUM FREEZE all databases including template databases.

> 5.
> + /*
> + * TODO: In future, could oldestMultiXactMemberOffset be stored in shmem,
> + *
> pg_controdata, alongside oldestMultiXactId?
> + */
>
> You might want to write the comment as:
> XXX: We can store oldestMultiXactMemberOffset in shmem, pg_controldata
> alongside oldestMultiXactId?

Thanks, done.

> 6.
> + * Returns vacuum_multixact_freeze_max_age, adjusted down to prevent
> excessive use
> + * of addressable
> multixact member space if required.
>
> I think here you mean autovacuum_multixact_freeze_max_age?

Thanks, done.

FWIW, in some future release, I think we should consider getting a
bigger multixact member address space that wraps around at 2^48 or
2^64 instead of 2^32, so that we can sidestep the whole business and
go back to having just xid and mxid wraparounds to worry about.
pg_multixact/offsets would be 50% or 100% bigger (an extra byte or two
per multixact), but it's not very big.  pg_multiact/members would be
no bigger for any workload that currently works without hitting the
wraparound error, but could grow bigger if needed.

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Tue, Apr 28, 2015 at 7:13 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Jeff Janes wrote:
>> Starting with  commit b69bf30b9bfacafc733a9ba7 and continuing to this
>> just-described commit, I can no longer upgrade from a 9.2.10 database using
>> pg_upgrade.
>
> Here's a patch, but I don't like it too much.  Will think more about it,
> probably going to push something tomorrow.

What don't you like about it?  We should get something committed here;
it's not good for the back-branches to be in a state where pg_upgrade
will break.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas wrote:
> On Tue, Apr 28, 2015 at 7:13 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> > Jeff Janes wrote:
> >> Starting with  commit b69bf30b9bfacafc733a9ba7 and continuing to this
> >> just-described commit, I can no longer upgrade from a 9.2.10 database using
> >> pg_upgrade.
> >
> > Here's a patch, but I don't like it too much.  Will think more about it,
> > probably going to push something tomorrow.
>
> What don't you like about it?  We should get something committed here;
> it's not good for the back-branches to be in a state where pg_upgrade
> will break.

Yeah, I managed to find a real fix which I will push shortly.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Jeff Janes wrote:
> On Tue, Apr 28, 2015 at 4:13 PM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>
> > Jeff Janes wrote:
> > > Starting with  commit b69bf30b9bfacafc733a9ba7 and continuing to
> > > this just-described commit, I can no longer upgrade from a 9.2.10
> > > database using pg_upgrade.
> >
> > Here's a patch, but I don't like it too much.  Will think more about it,
> > probably going to push something tomorrow.
>
> It looks like that patch is targeted to 9.4 branch.  I couldn't readily get
> it to apply on HEAD.  I tested it on 9.4, and it solved the problem there.

Yeah, I wrote it in 9.3.  However, it was wrong; or at least there's a
better way to formulate it, and the new formulation applies without
conflict from 9.3 to master.  So I pushed that instead.

Thanks!

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Apr 30, 2015 at 5:17 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, Apr 29, 2015 at 11:41 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> 1. It seems that you are using
>> autovacuum_multixact_freeze_max_age_adjusted()
>> only at couple of places, like it is not used in below calc:
>>
>> vacuum_set_xid_limits()
>> {
>> ..
>> mxid_freezemin = Min(mxid_freezemin,
>> autovacuum_multixact_freeze_max_age / 2);
>> ..
>> }
>>
>> What is the reason of using this calculation at some places and
>> not at other places?
>
> You're right, I will review the other places where that variable is
> used and come back on this point.

Those other places are for capping the effective table and tuple
multixact freeze ages for manual vacuums, so that manual vacuums (say
in nightly cronjobs) get a chance to run a wraparound scans before
autovacuum kicks in at a less convenient time.  So, yeah, I think we
want to incorporate member wraparound prevention into that logic, and
I will add that in the next version of the patch.

>> 2.
>> @@ -2684,8 +2719,8 @@ relation_needs_vacanalyze(Oid relid,
>>   : autovacuum_freeze_max_age;
>>
>>   multixact_freeze_max_age = (relopts && relopts->multixact_freeze_max_age
>>>= 0)
>> - ? Min(relopts->multixact_freeze_max_age,
>> autovacuum_multixact_freeze_max_age)
>> - : autovacuum_multixact_freeze_max_age;
>> + ? Min(relopts->multixact_freeze_max_age,
>> autovacuum_multixact_freeze_max_age_adjusted())
>> + : autovacuum_multixact_freeze_max_age_adjusted();
>>
>>
>> It seems that it will try to read from offset file for each
>> relation which might or might not be good, shall we try to
>> cache the oldestMultiXactMemberOffset?
>
> True.  I will look into this.

The version attached reuses the value by passing it as a new parameter
to the functions relation_needs_vacanalyze and table_recheck_autovac.

[...]

>> 4. Can you please share results that can show improvement
>> with current patch versus un-patched master?
> [...]
> 2.  We get to somewhere in the 73-75% SLRU used range before
> wraparound vacuums are triggered.  We probably need to spread things
> out more that that.

The version attached should spread the work out a lot better.  Instead
of using a proportion of your autovacuum_multixact_max_freeze_age, it
uses a proportion of the number of active multixacts (see
compute_max_multixact_age_to_avoid_member_wrap).

> 3.  When the autovacuum runs, it advances oldest_mxid by different
> amounts each time; that's because I'm using the adjusted freeze max
> age (the max age of a table before it gets a wraparound vacuum) as our
> freeze min age (the max age for individual tuples before they're
> frozen) here:
>
> @@ -1931,7 +1964,9 @@ do_autovacuum(void)
>   {
>   default_freeze_min_age = vacuum_freeze_min_age;
>   default_freeze_table_age = vacuum_freeze_table_age;
> - default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
> + default_multixact_freeze_min_age =
> + Min(vacuum_multixact_freeze_min_age,
> + autovacuum_multixact_freeze_max_age_adjusted());
>   default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
>   }

That was just completely wrong...  removed, and replaced with code
that sets multixact_freeze_min_age to zero in table_recheck_autovac in
this scenario.

I will post another version with some more test results soon, but
wanted to share this WIP patch and respond to the above questions.

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Fri, May 1, 2015 at 6:51 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Those other places are for capping the effective table and tuple
> multixact freeze ages for manual vacuums, so that manual vacuums (say
> in nightly cronjobs) get a chance to run a wraparound scans before
> autovacuum kicks in at a less convenient time.  So, yeah, I think we
> want to incorporate member wraparound prevention into that logic, and
> I will add that in the next version of the patch.

+1.  On a quick read-through of the patch, the biggest thing that
jumped out at me was that it only touches the autovacuum logic.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, May 2, 2015 at 7:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, May 1, 2015 at 6:51 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Those other places are for capping the effective table and tuple
>> multixact freeze ages for manual vacuums, so that manual vacuums (say
>> in nightly cronjobs) get a chance to run a wraparound scans before
>> autovacuum kicks in at a less convenient time.  So, yeah, I think we
>> want to incorporate member wraparound prevention into that logic, and
>> I will add that in the next version of the patch.
>
> +1.  On a quick read-through of the patch, the biggest thing that
> jumped out at me was that it only touches the autovacuum logic.

Here's a new version which sets up the multixact parameters in
ExecVacuum for regular VACUUM commands just like it does for
autovacuum if needed.  When computing
max_multixact_age_to_avoid_member_wrap for a manual vacuum, it uses
lower constants, so that any manually scheduled vacuums get a chance
to deal with some of this problem before autovacuum has to.  Here are
the arbitrary constants currently used:  at 50% member address space
usage, autovacuum starts wraparound scan of tables with the oldest
active multixacts, and then younger ones as the usage increases, until
at 75% usage it vacuums with multixact_freeze_table_age = 0; for
manual VACUUM those numbers are halved so that it has a good head
start.

Halving the thresholds so much lower for manual vacuums may be too
much of a head start, but if we give it only a small head start, it
seems to me that you'd finish up with slim odds of actually getting
the wraparound scans done by your scheduled vacuum job.  A head start
of 25% of the usage before autovacuum starts on tables with the oldest
relminmxids creates a target big enough that you might hit it with
your vacuum cronjob.  Also, as far as I can tell, manual vacuums will
only help you get the wraparound scans on your *connectable* databases
done at a time that suits you better, you'll still be dependent on
autovacuum to deal with non-connectable databases, in other words
template0.  In practice I guess if you run vacuumdb -a at midnight
you'll see all the pg_database.datminmxid values advance except for
template0's, and then some time later, most likely during busy load
time producing many multixact members, member space usage will finally
hit 50%, autovacuum will very quickly process template0, the
cluster-wide oldest mxid will finally advance, and then segment files
will be deleted at the next checkpoint.  Or am I missing something?

Also attached is the output of the monitor.sh script posted upthread,
while running explode_mxact_members.c.  It looks better than the last
results to me: whenever usage reaches 50%, autovacuum advances things
such that usage drops right back to 0% (because it now uses
multixact_freeze_min_age = 0) , and the system will happily chug on
forever.  What this test doesn't really show adequately is that if you
had a lot of different tables and databases with different relminmxid
values, they'd be vacuumed at different times.  I should probably come
up with a way to demonstrate that...

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Thu, Apr 30, 2015 at 10:47 AM, Thomas Munro <
thomas.munro@enterprisedb.com> wrote:
>
> On Wed, Apr 29, 2015 at 11:41 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
>
> > 3. currently there is some minimum limit of
autovacuum_multixact_freeze_age
> > (10000000)
> > which might not be honored by this calculation, so not sure if that can
> > impact the
> > system performance in some cases where it is currently working sane.
>
> The reason why we need to be able to set the effective freeze age
> below that minimum in cases of high member data consumption rates is
> that you could hit the new member space wraparound prevention error
> before you consume anywhere near that many multixact IDs.  That
> minimum may well be entirely reasonable if the only thing you're
> worried about is multixact ID wraparound prevention.
>
> For example, my test program eats an average of 250 members per
> multixact ID when run with 500 sessions (each loop creates 500
> multixact IDs having 1, 2, 3, ..., 500 members).  At that rate, you'll
> run out of addressable member space after 2^32 / 250 = 17,179,869
> multixact IDs.  To prevent an error condition using only the existing
> multixact ID wraparound prevention machinery, we need to have an
> effective max table age (so that autovacuum wakes up and scans all
> tables) and min freeze age (so that it actually freezes the tuples)
> below that number.  So we have to ignore the GUC minimum in this
> situation.
>

I understand that point, but I mentioned so that if there is some specific
reason for keeping the current minimum value, then we should evaluate
that we have not broken the same by not honouring the minimum value of
GUC.  As far as I can see from code, there seems to be one place
(refer below code) where that value is used to calculate Warning limit for
multixacts and the current patch doesn't seem to have any impact on the
same.

SetMultiXactIdLimit()
{
..
multiWarnLimit = multiStopLimit - 10000000;
}


> ...
>
> Observations:
>
> 1.  Sometimes the values don't change from minute to minute,
> presumably because there hasn't been a checkpoint to update
> pg_controldata on disk, but hopefully we can still see what's going on
> here despite the slight lag in the data.
>

Yeah and I think this means that there will no advancement for oldest
multixactid and deletion of files if the checkpoints are configured for
a timeout value.  I think there is no harm in specifying this in document
if it is currently not specified.

> 2.  We get to somewhere in the 73-75% SLRU used range before
> wraparound vacuums are triggered.  We probably need to spread things
> out more that that.
>
> 3.  When the autovacuum runs, it advances oldest_mxid by different
> amounts each time; that's because I'm using the adjusted freeze max
> age (the max age of a table before it gets a wraparound vacuum) as our
> freeze min age (the max age for individual tuples before they're
> frozen) here:
>
> @@ -1931,7 +1964,9 @@ do_autovacuum(void)
>   {
>   default_freeze_min_age = vacuum_freeze_min_age;
>   default_freeze_table_age = vacuum_freeze_table_age;
> - default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
> + default_multixact_freeze_min_age =
> + Min(vacuum_multixact_freeze_min_age,
> + autovacuum_multixact_freeze_max_age_adjusted());
>   default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
>   }
>
> Without that change, autovacuum would trigger repeatedly as we got
> near 75% SLRU usage but not freeze anything, because
> default_multixact_freeze_min_age was higher than the age of any tuples
> (which had only made it to an age of around ~12 million; actually it's
> not exactly the tuple age per se... I don't fully understand the
> treatment of locker and updater multixact IDs in the vacuum code,
> HeapTupleSatisfiesVacuum and heap_freeze_tuple etc yet so I'm not sure
> exactly how that value translates into vacuum work, but I can see
> experimentally that a low multixact freeze min age is needed to get
> relminxmid moved forward).
>
> It's good that freeze table age ramps down so that the autovacuum
> launcher trigger point jumps around a bit and we spread the autovacuum
> launches over time, but it's not great that we finish up truncating
> different amounts of multixacts and associated SLRU each time.  We
> could instead use a freeze min age of 0 to force freezing of *all*
> tuples if this is a member-space-wraparound-prevention vacuum (that
> is, if autovacuum_multixact_freeze_max_age !=
> autovacuum_multixact_freeze_max_age_adjusted()).

We already set vacuum_multixact_freeze_min_age to half of
autovacuum_multixact_freeze_max_age so that autovacuums to
prevent MultiXact wraparound won't occur too frequently as per below
code:

vacuum_set_xid_limits()
{
..
mxid_freezemin = Min(mxid_freezemin,

autovacuum_multixact_freeze_max_age / 2);
Assert(mxid_freezemin >= 0);
..
}

Now if we set it to zero, then I think it might lead to excessive
freezing and inturn more I/O without the actual need (more space
for multixact members)

>
> There is less to say about the results with an unpatched server: it
> drives in a straight line for a while, and then crashes into a wall
> (ie the new error preventing member wraparound), which I see you have
> also reproduced.  It's used up all of the circular member space, but
> only has around 17 million multixacts so autovacuum can't help you
> (it's not even possible to set autovacuum_multixact_freeze_max_age
> below 100 million), so to get things moving again you need to manually
> VACUUM FREEZE all databases including template databases.
>

In my tests on setting vacuum multixact parameter
(vacuum_multixact_freeze_table_age and vacuum_multixact_freeze_min_age)
values to zero, it has successfuly finished the tests (no warning and I
could
see truncation of files in members directory) , so I think one might argue
that in many cases one could get the available space for members by
just setting appropriate values for vacuum_multixact_*  params, but I feel
it is better to have some auto adjustment algorithm like this patch is
trying to do so that even if those values are not set appropriately, it can
avoid the wraparound error.  I think the only thing we might need to be
cautious about is that new calculation should not make it worse (less
aggresive) in case of lower values for vacuum_multixact_* parameters.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sat, May 2, 2015 at 11:46 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
>
> On Sat, May 2, 2015 at 7:08 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, May 1, 2015 at 6:51 AM, Thomas Munro
> > <thomas.munro@enterprisedb.com> wrote:
> >> Those other places are for capping the effective table and tuple
> >> multixact freeze ages for manual vacuums, so that manual vacuums (say
> >> in nightly cronjobs) get a chance to run a wraparound scans before
> >> autovacuum kicks in at a less convenient time.  So, yeah, I think we
> >> want to incorporate member wraparound prevention into that logic, and
> >> I will add that in the next version of the patch.
> >
> > +1.  On a quick read-through of the patch, the biggest thing that
> > jumped out at me was that it only touches the autovacuum logic.
>
>
> Also attached is the output of the monitor.sh script posted upthread,
> while running explode_mxact_members.c.  It looks better than the last
> results to me: whenever usage reaches 50%, autovacuum advances things
> such that usage drops right back to 0% (because it now uses
> multixact_freeze_min_age = 0) , and the system will happily chug on
> forever.  What this test doesn't really show adequately is that if you
> had a lot of different tables and databases with different relminmxid
> values, they'd be vacuumed at different times.  I should probably come
> up with a way to demonstrate that...
>

About data, I have extracted parts where there is a change in
oldest_mxid and segments

time segments usage_fraction usage_kb oldest_mxid next_mxid next_offset

13:48:36 1 0 16 1 1 0
13:49:36 369 .0044 94752 1 1 0
..
14:44:04 41703 .5083 10713400 1 8528909 2140755909

14:45:05 1374 .0167 352960 8573819 8722521 2189352521
..
15:37:16 41001 .4997 10529528 8573819 17060811 4282263311
..
15:38:16 709 .0086 182056 17132168 17254423 35892627
..
16:57:15 41440 .5051 10644712 17132168 25592713 2128803417
..
16:58:16 1120 .0136 287416 25695507 25786824 2177525278

Based on this data, it seems that truncation of member space
as well as advancement of oldest multixact id happens once
it reaches 50% usage and at that time segments drops down to almost
zero.  This happens repeatedly after 1 hour and in-between there
is no progress which indicates that all the work happens at
one go rather than in spreaded way. Won't this choke the system
when it happens due to I/O, isn't it better if we design it in a way such
that it is spreaded over period of time rather than doing everything at
one go?

--
+int
+compute_max_multixact_age_to_avoid_member_wrap(bool manual)
{
..
+ if (members <= safe_member_count)
+ {
+ /*
+ * There is no danger of
member wrap, so return a number that is not
+ * lower than autovacuum_multixact_freeze_max_age.
+
 */
+ return -1;
+ }
..

The above code doesn't seem to match its comments.
Comment says "..not lower than autovacuum_multixact_freeze_max_age",
but then return -1.  It seems to me here we should return unchanged
autovacuum_multixact_freeze_max_age as it was coded in the initial
version of patch.  Do you have any specific reason to change it?


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sun, May 3, 2015 at 4:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> About data, I have extracted parts where there is a change in
> oldest_mxid and segments
>
> time segments usage_fraction usage_kb oldest_mxid next_mxid next_offset
>
> 13:48:36 1 0 16 1 1 0
> 13:49:36 369 .0044 94752 1 1 0
> ..
> 14:44:04 41703 .5083 10713400 1 8528909 2140755909
>
> 14:45:05 1374 .0167 352960 8573819 8722521 2189352521
> ..
> 15:37:16 41001 .4997 10529528 8573819 17060811 4282263311
> ..
> 15:38:16 709 .0086 182056 17132168 17254423 35892627
> ..
> 16:57:15 41440 .5051 10644712 17132168 25592713 2128803417
> ..
> 16:58:16 1120 .0136 287416 25695507 25786824 2177525278
>
> Based on this data, it seems that truncation of member space
> as well as advancement of oldest multixact id happens once
> it reaches 50% usage and at that time segments drops down to almost
> zero.  This happens repeatedly after 1 hour and in-between there
> is no progress which indicates that all the work happens at
> one go rather than in spreaded way. Won't this choke the system
> when it happens due to I/O, isn't it better if we design it in a way such
> that it is spreaded over period of time rather than doing everything at
> one go?

At 50% usage it starts vacuuming tables that have the very oldest mxid
in the system.  That is, at 50% usage, we try to select the smallest
non-zero fraction of tables that can be selected based on relminmxid
alone, with any luck exactly one or close to it.  As usage increases,
we decrease the cutoff age slowly so we start selecting tables with
slightly newer relminmxid values for vacuuming too.  Only if it
reaches 75% usage will it vacuum everything it can and eat all your
IO.  In a real system, I suppose there would be lots of big tables
that need to be vacuumed, the vacuuming would take some time, and they
would tend to have different relminmxids so that vacuuming would be
effectively spread out by this selection algorithm.  I think.  Perhaps
we should devise a test procedure to try to see if that happens.  We'd
need to create a bunch of big tables and modify monitor.sh to show the
relminmxid for each of them so that you could see when they are being
processed -- I will look into that.

Restricting ourselves to selecting tables to vacuum using their
relminmxid alone makes this patch small since autovacuum already works
that way.  We *could* introduce code that would be able to spread out
the work of vacuuming tables that happen to have identical or very
close relminmxid (say by introducing some non-determinism or doing
something weird based on hashing table oids and the time to explicitly
spread the start of processing over time, or <your idea here>), but I
didn't want to propose anything too big/complicated/clever/stupid and
I suspect that the relminmxid values will tend to diverge over time
(but I could be wrong about that, if they all start at 1 and then move
forward in lockstep over long periods of time then what I propose is
not good enough... let's see if we can find out).

> --
> +int
> +compute_max_multixact_age_to_avoid_member_wrap(bool manual)
> {
> ..
> + if (members <= safe_member_count)
> + {
> + /*
> + * There is no danger of
> member wrap, so return a number that is not
> + * lower than autovacuum_multixact_freeze_max_age.
> +
> */
> + return -1;
> + }
> ..
>
> The above code doesn't seem to match its comments.
> Comment says "..not lower than autovacuum_multixact_freeze_max_age",
> but then return -1.  It seems to me here we should return unchanged
> autovacuum_multixact_freeze_max_age as it was coded in the initial
> version of patch.  Do you have any specific reason to change it?

Oops, the comment is fixed in the attached patch.

In an earlier version, I was only dealing with the autovacuum case.
Now that the VACUUM command also calls it, I didn't want this
compute_max_multixact_age_to_avoid_member_wrap function to assume that
it was being called by autovacuum code and return the
autovacuum-specific GUC in the case that no special action is needed.
Also, the function no longer computes a value by scaling
autovacuum_multixact_freeze_max_age, it now scales the current number
of active multixacts, so that we can begin selecting a small non-zero
number of tables to vacuum as soon as we exceed safe_member_count as
described above (whereas when we used a scaled down
autovaccum_multixact_freeze_max_age, we usually didn't select any
tables at all until we scaled it down a lot, ie until we got close to
dangerous_member_count).  Finally, I wanted a special value like -1
for 'none' so that table_recheck_autovac and ExecVacuum could use a
simple test >= 0 to know that they also need to set
multixact_freeze_min_age to zero in the case of a
member-space-triggered vacuum, so that we get maximum benefit from our
table scans by freezing all relevant tuples, not just some older ones
(that's why you see usage drop to almost zero each time, whereas the
monitor.sh results I showed from the earlier patch trimmed usage by
varying amounts, which meant that autovacuum wraparounds would need to
be done again sooner).  Does that make sense?

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Mon, May 4, 2015 at 5:19 AM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
>
> On Sun, May 3, 2015 at 4:40 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
> > --
> > +int
> > +compute_max_multixact_age_to_avoid_member_wrap(bool manual)
> > {
> > ..
> > + if (members <= safe_member_count)
> > + {
> > + /*
> > + * There is no danger of
> > member wrap, so return a number that is not
> > + * lower than autovacuum_multixact_freeze_max_age.
> > +
> > */
> > + return -1;
> > + }
> > ..
> >
> > The above code doesn't seem to match its comments.
> > Comment says "..not lower than autovacuum_multixact_freeze_max_age",
> > but then return -1.  It seems to me here we should return unchanged
> > autovacuum_multixact_freeze_max_age as it was coded in the initial
> > version of patch.  Do you have any specific reason to change it?
>
> Oops, the comment is fixed in the attached patch.
>
> In an earlier version, I was only dealing with the autovacuum case.
> Now that the VACUUM command also calls it, I didn't want this
> compute_max_multixact_age_to_avoid_member_wrap function to assume that
> it was being called by autovacuum code and return the
> autovacuum-specific GUC in the case that no special action is needed.
> Also, the function no longer computes a value by scaling
> autovacuum_multixact_freeze_max_age, it now scales the current number
> of active multixacts, so that we can begin selecting a small non-zero
> number of tables to vacuum as soon as we exceed safe_member_count as
> described above

I am slightly worried that if for scaling we don't consider the value for
multixact_*_age as configured by user, Vacuum/Autovacuum might
behave totally different from what user is expecting.  Basically
it will be dominated based on member space usage and will ignore
the values set by user for multixact_*_age parameters.  One way
could be to use minimum of the value calculated based on member
space and the value specified by user for multixact related parameters
as suggested in points 1 and 2 (below in mail).

One more thing, I think the current calculation considers members
usage, shouldn't we try to consider offset usage as well?


> (whereas when we used a scaled down
> autovaccum_multixact_freeze_max_age, we usually didn't select any
> tables at all until we scaled it down a lot, ie until we got close to
> dangerous_member_count).  Finally, I wanted a special value like -1
> for 'none' so that table_recheck_autovac and ExecVacuum could use a
> simple test >= 0 to know that they also need to set
> multixact_freeze_min_age to zero in the case of a
> member-space-triggered vacuum, so that we get maximum benefit from our
> table scans by freezing all relevant tuples, not just some older ones
>

I think setting multixact_freeze_min_age to zero could be too aggresive
for I/O.  Yes with this you can get maximum benefit, but at cost of
increased I/O.  How would you justify setting it to zero as appropriate
w.r.t increased I/O?

Few more observations:

1.
@@ -2687,6 +2796,10 @@ relation_needs_vacanalyze(Oid relid,
  ? Min(relopts-
>multixact_freeze_max_age, autovacuum_multixact_freeze_max_age)
  :
autovacuum_multixact_freeze_max_age;

+ /* Special settings if we are running out of member address space.
*/
+ if (max_multixact_age_to_avoid_member_wrap >= 0)
+ multixact_freeze_max_age =
max_multixact_age_to_avoid_member_wrap;
+

Isn't it better to use minimum to already computed value of
multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap?

multixact_freeze_max_age = Min(multixact_freeze_max_age,
max_multixact_age_to_avoid_member_wrap);

Similar change needs to be done in table_recheck_autovac()

2.
@@ -1118,7 +1197,12 @@ do_start_worker(void)

  /* Also determine the oldest datminmxid we will consider. */
  recentMulti = ReadNextMultiXactId();
- multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
+ max_multixact_age_to_avoid_member_wrap =
+ compute_max_multixact_age_to_avoid_member_wrap(false);
+ if (max_multixact_age_to_avoid_member_wrap >= 0)
+ multiForceLimit = recentMulti - max_multixact_age_to_avoid_member_wrap;
+ else
+ multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;

Here also, isn't it better to use minimum of
autovacuum_multixact_freeze_max_age
and max_multixact_age_to_avoid_member_wrap.

3.
+int
+compute_max_multixact_age_to_avoid_member_wrap(bool manual)
+{
+ MultiXactOffset members;
+ uint32 multixacts;
+ double fraction;
+ MultiXactOffset safe_member_count = MaxMultiXactOffset / 2;

It is not completely clear what is more appropriate value
for safe_member_count (25% or 50%).  Anybody else have any
opinion on this value?

4. Once we conclude on final algorithm, we should update the
same in docs as well, probably in description at below link:
http://www.postgresql.org/docs/devel/static/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Mon, May 4, 2015 at 11:49 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sun, May 3, 2015 at 4:40 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Based on this data, it seems that truncation of member space
>> as well as advancement of oldest multixact id happens once
>> it reaches 50% usage and at that time segments drops down to almost
>> zero.  This happens repeatedly after 1 hour and in-between there
>> is no progress which indicates that all the work happens at
>> one go rather than in spreaded way. Won't this choke the system
>> when it happens due to I/O, isn't it better if we design it in a way such
>> that it is spreaded over period of time rather than doing everything at
>> one go?
>
> At 50% usage it starts vacuuming tables that have the very oldest mxid
> in the system.  That is, at 50% usage, we try to select the smallest
> non-zero fraction of tables that can be selected based on relminmxid
> alone, with any luck exactly one or close to it.  As usage increases,
> we decrease the cutoff age slowly so we start selecting tables with
> slightly newer relminmxid values for vacuuming too.  Only if it
> reaches 75% usage will it vacuum everything it can and eat all your
> IO.  In a real system, I suppose there would be lots of big tables
> that need to be vacuumed, the vacuuming would take some time, and they
> would tend to have different relminmxids so that vacuuming would be
> effectively spread out by this selection algorithm.  I think.  Perhaps
> we should devise a test procedure to try to see if that happens.  We'd
> need to create a bunch of big tables and modify monitor.sh to show the
> relminmxid for each of them so that you could see when they are being
> processed -- I will look into that.

After initdb, I did:

postgres=# create table foo as select 42 as id;
SELECT 1
postgres=# create table cat_a as select generate_series(1, 10000000);
SELECT 10000000
postgres=# create table cat_b as select generate_series(1, 10000000);
SELECT 10000000
postgres=# create table cat_c as select generate_series(1, 10000000);
SELECT 10000000
postgres=# create table cat_d as select generate_series(1, 10000000);
SELECT 10000000

Then I started monitor.sh (new version attached) and started
explode_mxact_members.c and recorded the attached file
monitor-output.txt.  The result is not great: since relminmxid starts
at 1 for all tables created in a brand new database, 3
(autovacuum_max_workers) vacuums started around the same time, and
later they didn't really seem to be diverging, a bit under an hour
later the same three were triggered again, and so on.

I have also attached monitor-output-fail.txt, a case where the
database size is too large to be vacuumed fast enough for that rate of
member space consumption.  It starts vacuuming with 3 workers at 50%,
and hasn't finished by 100%, so the new error is raised and no more
multixacts can be created.  Eventually the vacuuming completes and
progress can be made.  The database was set up as above, but the
tables have 10 times more rows.

> Restricting ourselves to selecting tables to vacuum using their
> relminmxid alone makes this patch small since autovacuum already works
> that way.  We *could* introduce code that would be able to spread out
> the work of vacuuming tables that happen to have identical or very
> close relminmxid (say by introducing some non-determinism or doing
> something weird based on hashing table oids and the time to explicitly
> spread the start of processing over time, or <your idea here>), but I
> didn't want to propose anything too big/complicated/clever/stupid and
> I suspect that the relminmxid values will tend to diverge over time
> (but I could be wrong about that, if they all start at 1 and then move
> forward in lockstep over long periods of time then what I propose is
> not good enough... let's see if we can find out).

Maybe we do need to consider something more radical, since tables that
have the same relminmxid are fairly easy to produce (for example by
restoring a database dump into a new database).  I will try to do
something that adds some noise to the signal to deal with this case
(of the top of my head, something along the lines of time_in_minutes %
16 == table_oid %16...,  or adjust the multixact cutoff age with a
recipe including table_oid % something, or ... not sure yet).  If
anyone has any better ideas, I am all ears.

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Mon, May 4, 2015 at 6:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> [...]
> One more thing, I think the current calculation considers members
> usage, shouldn't we try to consider offset usage as well?

Offsets are indexed by multixact ID:

#define MultiXactIdToOffsetPage(xid) \
        ((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
#define MultiXactIdToOffsetEntry(xid) \
        ((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)

The existing multixact wraparound prevention code is already managing
the 32 bit multixact ID space.  The problem with members comes about
because each one of those multixact IDs can have arbitrary numbers of
members, and yet the members are also addressed with a 32 bit index.
So we are trying to hijack the multixact ID wraparound prevention and
make it more aggressive if member space appears to be running out.
(Perhaps in future there should be a 64 bit index for member indexes
so that this problem disappears?)

>> (whereas when we used a scaled down
>> autovaccum_multixact_freeze_max_age, we usually didn't select any
>> tables at all until we scaled it down a lot, ie until we got close to
>> dangerous_member_count).  Finally, I wanted a special value like -1
>> for 'none' so that table_recheck_autovac and ExecVacuum could use a
>> simple test >= 0 to know that they also need to set
>> multixact_freeze_min_age to zero in the case of a
>> member-space-triggered vacuum, so that we get maximum benefit from our
>> table scans by freezing all relevant tuples, not just some older ones
>>
>
> I think setting multixact_freeze_min_age to zero could be too aggresive
> for I/O.  Yes with this you can get maximum benefit, but at cost of
> increased I/O.  How would you justify setting it to zero as appropriate
> w.r.t increased I/O?

I assumed that if you were already vacuuming all your tablesto avoid
running out of member space, you would want to freeze any tuples you
possibly could to defer the next wraparound scan for as long as
possible, since wraparound scans are enormously expensive.

> Few more observations:
>
> 1.
> @@ -2687,6 +2796,10 @@ relation_needs_vacanalyze(Oid relid,
>   ? Min(relopts-
>>multixact_freeze_max_age, autovacuum_multixact_freeze_max_age)
>   :
> autovacuum_multixact_freeze_max_age;
>
> + /* Special settings if we are running out of member address space.
> */
> + if (max_multixact_age_to_avoid_member_wrap >= 0)
> + multixact_freeze_max_age =
> max_multixact_age_to_avoid_member_wrap;
> +
>
> Isn't it better to use minimum to already computed value of
> multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap?
>
> multixact_freeze_max_age = Min(multixact_freeze_max_age,
> max_multixact_age_to_avoid_member_wrap);

Except that I am using -1 as a special value.  But you're right, I
guess it should be like this:

if (max_multixact_age_to_avoid_member_wrap >= 0)
    multixact_freeze_max_age = Min(multixact_freeze_max_age,
max_multixact_age_to_avoid_member_wrap);

> Similar change needs to be done in table_recheck_autovac()
>
> 2.
> @@ -1118,7 +1197,12 @@ do_start_worker(void)
>
>   /* Also determine the oldest datminmxid we will consider. */
>   recentMulti = ReadNextMultiXactId();
> - multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
> + max_multixact_age_to_avoid_member_wrap =
> + compute_max_multixact_age_to_avoid_member_wrap(false);
> + if (max_multixact_age_to_avoid_member_wrap >= 0)
> + multiForceLimit = recentMulti - max_multixact_age_to_avoid_member_wrap;
> + else
> + multiForceLimit = recentMulti - autovacuum_multixact_freeze_max_age;
>
> Here also, isn't it better to use minimum of
> autovacuum_multixact_freeze_max_age
> and max_multixact_age_to_avoid_member_wrap.

Yeah, with the same proviso about -1.

> 3.
> +int
> +compute_max_multixact_age_to_avoid_member_wrap(bool manual)
> +{
> + MultiXactOffset members;
> + uint32 multixacts;
> + double fraction;
> + MultiXactOffset safe_member_count = MaxMultiXactOffset / 2;
>
> It is not completely clear what is more appropriate value
> for safe_member_count (25% or 50%).  Anybody else have any
> opinion on this value?
>
> 4. Once we conclude on final algorithm, we should update the
> same in docs as well, probably in description at below link:
> http://www.postgresql.org/docs/devel/static/routine-vacuuming.html#VACUUM-FOR-MULTIXACT-WRAPAROUND

Agreed.

--
Thomas Munro
http://www.enterprisedb.com
On Mon, May 4, 2015 at 12:42 PM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
>
> On Mon, May 4, 2015 at 6:25 PM, Amit Kapila <amit.kapila16@gmail.com>
wrote:
> > [...]
> > One more thing, I think the current calculation considers members
> > usage, shouldn't we try to consider offset usage as well?
>
> Offsets are indexed by multixact ID:
>
> #define MultiXactIdToOffsetPage(xid) \
>         ((xid) / (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
> #define MultiXactIdToOffsetEntry(xid) \
>         ((xid) % (MultiXactOffset) MULTIXACT_OFFSETS_PER_PAGE)
>
> The existing multixact wraparound prevention code is already managing
> the 32 bit multixact ID space.  The problem with members comes about
> because each one of those multixact IDs can have arbitrary numbers of
> members, and yet the members are also addressed with a 32 bit index.
> So we are trying to hijack the multixact ID wraparound prevention and
> make it more aggressive if member space appears to be running out.
> (Perhaps in future there should be a 64 bit index for member indexes
> so that this problem disappears?)
>

Okay, that makes sense.

> >> (whereas when we used a scaled down
> >> autovaccum_multixact_freeze_max_age, we usually didn't select any
> >> tables at all until we scaled it down a lot, ie until we got close to
> >> dangerous_member_count).  Finally, I wanted a special value like -1
> >> for 'none' so that table_recheck_autovac and ExecVacuum could use a
> >> simple test >= 0 to know that they also need to set
> >> multixact_freeze_min_age to zero in the case of a
> >> member-space-triggered vacuum, so that we get maximum benefit from our
> >> table scans by freezing all relevant tuples, not just some older ones
> >>
> >
> > I think setting multixact_freeze_min_age to zero could be too aggresive
> > for I/O.  Yes with this you can get maximum benefit, but at cost of
> > increased I/O.  How would you justify setting it to zero as appropriate
> > w.r.t increased I/O?
>
> I assumed that if you were already vacuuming all your tablesto avoid
> running out of member space,

I think here you mean all tables that has relminmxid lesser than the
newly computed age (compute_max_multixact_age_to_avoid_member_wrap)

> you would want to freeze any tuples you
> possibly could to defer the next wraparound scan for as long as
> possible, since wraparound scans are enormously expensive.
>

The point is valid to an extent, but If we go by this logic, then currently
also we should set multixact_freeze_min_age as zero for wraparound
vacuum.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Sat, May 2, 2015 at 2:16 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here's a new version which sets up the multixact parameters in
> ExecVacuum for regular VACUUM commands just like it does for
> autovacuum if needed.  When computing
> max_multixact_age_to_avoid_member_wrap for a manual vacuum, it uses
> lower constants, so that any manually scheduled vacuums get a chance
> to deal with some of this problem before autovacuum has to.  Here are
> the arbitrary constants currently used:  at 50% member address space
> usage, autovacuum starts wraparound scan of tables with the oldest
> active multixacts, and then younger ones as the usage increases, until
> at 75% usage it vacuums with multixact_freeze_table_age = 0; for
> manual VACUUM those numbers are halved so that it has a good head
> start.

I think the 75% threshold for reducing multxact_freeze_table_age to
zero is fine, but I don't agree with the 50% cutoff.  The purpose of
autovacuum_multixact_freeze_max_age is to control the fraction of the
2^32-entry offset space that can be consumed before we begin viewing
the problem as urgent.  We have a setting for that because it needs to
be tunable, and the default value for that setting is 400 million,
which is roughly 10% of the members space.  That is a whole lot lower
than the 50% threshold you are proposing here.  Moreover, it leaves
the user with no meaningful choice: if the 50% threshold consumes too
much disk space, or doesn't leave enough room before we hit the wall,
then the user is simply hosed.  This is why I initially proposed that
the member-space-consumption-percentage at which we start derating
multixact_freeze_table_age should be based on
autovacuum_multixact_freeze_max_age/2^32.  That way,
autovacuum_multixact_freeze_max_age controls not only how aggressively
we try to reclaim offset space but also how aggressively we try to
reclaim member space.  The user can then tune the value, and the
default is the same in both cases.

I also think that halving the numbers for manual vacuums is arbitrary
and unprecedented.  The thought process isn't bad, but an autovacuum
currently behaves in most respects like a manual vacuum, and I'm
reluctant to make those more different.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, May 2, 2015 at 7:22 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> 3.  When the autovacuum runs, it advances oldest_mxid by different
>> amounts each time; that's because I'm using the adjusted freeze max
>> age (the max age of a table before it gets a wraparound vacuum) as our
>> freeze min age (the max age for individual tuples before they're
>> frozen) here:
>>
>> @@ -1931,7 +1964,9 @@ do_autovacuum(void)
>>   {
>>   default_freeze_min_age = vacuum_freeze_min_age;
>>   default_freeze_table_age = vacuum_freeze_table_age;
>> - default_multixact_freeze_min_age = vacuum_multixact_freeze_min_age;
>> + default_multixact_freeze_min_age =
>> + Min(vacuum_multixact_freeze_min_age,
>> + autovacuum_multixact_freeze_max_age_adjusted());
>>   default_multixact_freeze_table_age = vacuum_multixact_freeze_table_age;
>>   }
>>
>> Without that change, autovacuum would trigger repeatedly as we got
>> near 75% SLRU usage but not freeze anything, because
>> default_multixact_freeze_min_age was higher than the age of any tuples
>> (which had only made it to an age of around ~12 million; actually it's
>> not exactly the tuple age per se... I don't fully understand the
>> treatment of locker and updater multixact IDs in the vacuum code,
>> HeapTupleSatisfiesVacuum and heap_freeze_tuple etc yet so I'm not sure
>> exactly how that value translates into vacuum work, but I can see
>> experimentally that a low multixact freeze min age is needed to get
>> relminxmid moved forward).
>>
>> It's good that freeze table age ramps down so that the autovacuum
>> launcher trigger point jumps around a bit and we spread the autovacuum
>> launches over time, but it's not great that we finish up truncating
>> different amounts of multixacts and associated SLRU each time.  We
>> could instead use a freeze min age of 0 to force freezing of *all*
>> tuples if this is a member-space-wraparound-prevention vacuum (that
>> is, if autovacuum_multixact_freeze_max_age !=
>> autovacuum_multixact_freeze_max_age_adjusted()).
>
> We already set vacuum_multixact_freeze_min_age to half of
> autovacuum_multixact_freeze_max_age so that autovacuums to
> prevent MultiXact wraparound won't occur too frequently as per below
> code:
>
> vacuum_set_xid_limits()
> {
> ..
> mxid_freezemin = Min(mxid_freezemin,
>
> autovacuum_multixact_freeze_max_age / 2);
> Assert(mxid_freezemin >= 0);
> ..
> }
>
> Now if we set it to zero, then I think it might lead to excessive
> freezing and inturn more I/O without the actual need (more space
> for multixact members)

That point is certainly worthy of some consideration.  Letting the
freeze xmin get set to half of the (effective)
autovacuum_multixact_freeze_age would certainly be more consistent
with what we do elsewhere.  The policy trade-off is not as
straightforward as you are making it out to be, though:

1. Using a min freeze age of zero will result in half as many
full-table scans, because we'll advance relminmxid twice as far each
time.

2. But each one will freeze more stuff, some of which might have been
updated again before the next freeze pass, so we might do more
freezing in total.

So either policy might win, depending on whether you care more about
reducing reads (in which case you want a very low min freeze age) or
about reducing writes (in which case you want a higher min freeze
age).

All things being equal, I'd rather stick with the existing 50% policy
in the back-branches, rather than going to zero, but I'm not sure all
things are equal.  It matters what difference the higher value makes.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, May 3, 2015 at 7:49 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Restricting ourselves to selecting tables to vacuum using their
> relminmxid alone makes this patch small since autovacuum already works
> that way.  We *could* introduce code that would be able to spread out
> the work of vacuuming tables that happen to have identical or very
> close relminmxid (say by introducing some non-determinism or doing
> something weird based on hashing table oids and the time to explicitly
> spread the start of processing over time, or <your idea here>), but I
> didn't want to propose anything too big/complicated/clever/stupid and
> I suspect that the relminmxid values will tend to diverge over time
> (but I could be wrong about that, if they all start at 1 and then move
> forward in lockstep over long periods of time then what I propose is
> not good enough... let's see if we can find out).

So, the problem of everything moving in lockstep is one we already
have.  It's actually a serious operational problem for relfrozenxid,
because you might restore your database from pg_dump or similar and
every table will have a very similar relfrozenxid and so then the
anti-wraparound logic fires for all of them at the same time.  There
might be cases where MXIDs behave the same way, although I would think
it would be less common.

Anyway, solving that problem would be nice (particularly for xmin!),
but we shouldn't get into that with relation to this bug fix.  It's a
problem, but one that will probably take a good deal of work to solve,
and certainly not something we would back-patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Thomas Munro wrote:

> FWIW, in some future release, I think we should consider getting a
> bigger multixact member address space that wraps around at 2^48 or
> 2^64 instead of 2^32, so that we can sidestep the whole business and
> go back to having just xid and mxid wraparounds to worry about.
> pg_multixact/offsets would be 50% or 100% bigger (an extra byte or two
> per multixact), but it's not very big.  pg_multiact/members would be
> no bigger for any workload that currently works without hitting the
> wraparound error, but could grow bigger if needed.

Not sure that enlarging the addressable area to 48/64 bits is feasible,
TBH.  We already have many complaints that multixacts take too much disk
space; we don't want to make that 2^32 times worse, not even 2^16 times
worse.  I don't understand why you say it'd become 1 byte bigger per
multixact; it would have to be 4 more bytes (2^64) or 2 more bytes
(2^48), no?  If you have 150 million multixacts (the default freeze
table age) that would mean about 300 or 600 MB of additional disk space,
which is not insignificant: with the current system, in an database with
normal multixact usage of 4 members per multixact, members/ would use
about 2.8 GB, so 600 additional MB in offsets/ is large enough growth to
raise some more complaints.

(The 2^48 suggestion might be a tad more difficult to implement, note,
becase a lot of stuff relies on unsigned integer wraparound addition,
and I'm not sure we can have that with a 2^48 counter.  Maybe we could
figure how to make it work, but is it worth the bother?)

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas wrote:

> Anyway, solving that problem would be nice (particularly for xmin!),
> but we shouldn't get into that with relation to this bug fix.  It's a
> problem, but one that will probably take a good deal of work to solve,
> and certainly not something we would back-patch.

+1

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas <robertmhaas@gmail.com> wrote:

> 1. Using a min freeze age of zero will result in half as many
> full-table scans, because we'll advance relminmxid twice as far
> each time.
>
> 2. But each one will freeze more stuff, some of which might have
> been updated again before the next freeze pass, so we might do
> more freezing in total.
>
> So either policy might win, depending on whether you care more
> about reducing reads (in which case you want a very low min
> freeze age) or about reducing writes (in which case you want a
> higher min freeze age).
>
> All things being equal, I'd rather stick with the existing 50%
> policy in the back-branches, rather than going to zero, but I'm
> not sure all things are equal.  It matters what difference the
> higher value makes.

I really don't like the "honor the configured value of
vacuum_multixact_freeze_min_age until the members SLRU gets to 50%
of wraparound and then use zero" approach.  It made a lot more
sense to me to honor the configured value to 25% and decrease it in
a linear fashion until it hit zero at 75%.  It seems like maybe we
weren't aggressive enough in the dynamic adjustment of
autovacuum_multixact_freeze_max_age, but I'm not clear why fixing
that required the less gradual adjustment of the *_min_age setting.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
So I might have understood an earlier description of the proposed
solution all wrong, or this patch was designed without consideration to
that description.  What I thought would happen is that all freeze ages
would get multiplied by some factor <= 1, depending on the space used up
by members.  If members space usage is low enough, factor would remain
at 1 so things would behave as today.  If members space usage is larger
than X, the factor decreases smoothly and this makes freeze_min_age and
freeze_max_age decrease smoothly as well, for all vacuums equally.

For instance, we could choose a method to compute X based on considering
that a full 2^32 storage area for members is enough to store one
vacuum_multixact_freeze_table_age cycle of multixacts.  The default
value of this param is 150 million, and 2^32/150000000 = 28; so if your
average multixact size = 38, you would set the multiplier at 0.736 and
your effective freeze_table_age would become 110 million and effective
freeze_min_age would become 3.68 million.


As a secondary point, I find variable-names-as-documentation bad
practice.  Please don't use a long name such as
max_multixact_age_to_avoid_member_wrap; code becomes unwieldy.  A short
name such as safe_mxact_age preceded by a comment /* this variable is
the max that avoids member wrap */ seems more palatable; side-by-side
merges and all that!  I don't think long function names are as
problematic (though the name of your new function is still a bit too
long).

Please note that 9.4 and earlier do not have ExecVacuum; the
determination of freeze ages is done partly in gram.y (yuck).  Not sure
what will the patch look like in those branches.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera wrote:

> For instance, we could choose a method to compute X based on considering
> that a full 2^32 storage area for members is enough to store one
> vacuum_multixact_freeze_table_age cycle of multixacts.  The default
> value of this param is 150 million, and 2^32/150000000 = 28; so if your
> average multixact size = 38, you would set the multiplier at 0.736 and
> your effective freeze_table_age would become 110 million and effective
> freeze_min_age would become 3.68 million.

Actually, apologies --- this is not what I was thinking at all.  I got
distracted while I was writing the previous email.  My thinking was that
the values would be at their normal defaults when the wraparound is
distant, and the multiplier would start to become slightly less than 1
as the counter moves towards wraparound; by the time we're at at an
emergency i.e. we reach max_freeze_age, the values naturally become zero
(or perhaps just before we reach max_freeze_age, the values were 50% of
their normal values, so the drop to zero is not as dramatic).  Since
this is gradual, the behavior is not as jumpy as in the proposed patch.

Anyway this is in line with what Kevin is saying elsewhere: we shouldn't
just use the normal values all the time just up to the freeze_max_age
point; there should be some gradual ramp-up.

Perhaps we can combine this with the other idea of using a multiplier
connected to average size of multixact, if it doesn't become too
complicated, surprising, or both.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, May 5, 2015 at 8:36 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Thomas Munro wrote:
>
>> FWIW, in some future release, I think we should consider getting a
>> bigger multixact member address space that wraps around at 2^48 or
>> 2^64 instead of 2^32, so that we can sidestep the whole business and
>> go back to having just xid and mxid wraparounds to worry about.
>> pg_multixact/offsets would be 50% or 100% bigger (an extra byte or two
>> per multixact), but it's not very big.  pg_multiact/members would be
>> no bigger for any workload that currently works without hitting the
>> wraparound error, but could grow bigger if needed.
>
> Not sure that enlarging the addressable area to 48/64 bits is feasible,
> TBH.  We already have many complaints that multixacts take too much disk
> space; we don't want to make that 2^32 times worse, not even 2^16 times
> worse.  I don't understand why you say it'd become 1 byte bigger per
> multixact; it would have to be 4 more bytes (2^64) or 2 more bytes
> (2^48), no?  If you have 150 million multixacts (the default freeze
> table age) that would mean about 300 or 600 MB of additional disk space,
> which is not insignificant: with the current system, in an database with
> normal multixact usage of 4 members per multixact, members/ would use
> about 2.8 GB, so 600 additional MB in offsets/ is large enough growth to
> raise some more complaints.

Right, sorry, I must have been thinking of 40 bit or 48 bit indexes
when I said 1 or 2 bytes.

I can't help thinking there must be a different way to do this that
takes advantage of the fact that multixacts are often created by
copying all the members of an existing multixact and adding one new
one, so that there is a lot of duplication and churn (at least when
you have a workload that generates bigger multixacts, due to the
O(n^2) process of building them up xid by xid).

Maybe there is a way to store a pointer to some other multixact + a
new xid in a chain structure, but I don't see how to do the cleanup
when you have active multixacts with backwards references to older
multixacts.

Maybe you could find some way to leave gaps in member space (perhaps
by making member index point to member groups with space for 4 or 8
member xids), and MultiXactIdExpand could create new multixacts that
point to the same member offset but a different size so they see extra
members, but that would also waste disk space, be hard to synchronize
and you'd need to fall back to copying the members into new member
space when the spare space is filled anyway.
Thomas Munro wrote:

> I can't help thinking there must be a different way to do this that
> takes advantage of the fact that multixacts are often created by
> copying all the members of an existing multixact and adding one new
> one, so that there is a lot of duplication and churn (at least when
> you have a workload that generates bigger multixacts, due to the
> O(n^2) process of building them up xid by xid).

Yeah, Simon expressed the same thought to me some months ago, and I gave
it some think-time (but not at lot of it TBH).  I didn't see any way to
make it workable.

Normally, lockers go away reasonably quickly, so some of the original
members of the multixact are disappearing all the time.  Maybe one way
would be to re-use a multixact you have in your local cache, as long as
the only difference with the multixact you want is some locker
transaction(s) that have already ended.  Not sure how you would manage
the cache, though.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, May 5, 2015 at 2:29 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
>
>
> Please note that 9.4 and earlier do not have ExecVacuum; the
> determination of freeze ages is done partly in gram.y (yuck).  Not sure
> what will the patch look like in those branches.
>

One way to make fix back-patchable is to consider doing the changes
for Vacuum and AutoVacuum in one common path (vacuum_set_xid_limits())?
However, I think we might need to distinguish whether the call is from
Vacuum or AutoVacuum path.



With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
Here's a new patch, with responses to several reviews.

On Tue, May 5, 2015 at 8:41 AM, Kevin Grittner <kgrittn@ymail.com> wrote:
> I really don't like the "honor the configured value of
> vacuum_multixact_freeze_min_age until the members SLRU gets to 50%
> of wraparound and then use zero" approach.  It made a lot more
> sense to me to honor the configured value to 25% and decrease it in
> a linear fashion until it hit zero at 75%.  It seems like maybe we
> weren't aggressive enough in the dynamic adjustment of
> autovacuum_multixact_freeze_max_age, but I'm not clear why fixing
> that required the less gradual adjustment of the *_min_age setting.

Ok, the new patch uses 25% as the safe threshold, and then scales
multixact_freeze_table_age down from the current number of active
multixacts (ie to select the minimum number of tables) progressively
to 0 (to select all tables) when you reach 75% usage.  It also sets
multixact_freeze_min_age to half of the value it used for
multixact_table_age, so that reduces progressively too.  But see my
response to Robert below for discussion (question) about how to make
safe threshold depend on autovacuum_multixact_freeze_max_age.

On Tue, May 5, 2015 at 8:37 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> Anyway, solving that problem would be nice (particularly for xmin!),
>> but we shouldn't get into that with relation to this bug fix.  It's a
>> problem, but one that will probably take a good deal of work to solve,
>> and certainly not something we would back-patch.
>
> +1

OK, not attempting to do anything about that.  If your database has a
bunch of tables with the same relminmxid (for example after a recent
restore from pg_dump), you're going to get vacuums starting at the
same time.  If they're more scrambled up, you'll get your vacuum work
spread out better, and that is just as it is for normal mxid
wraparound prevention so that is OK for the purposes of this patch.

On Tue, May 5, 2015 at 6:46 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think the 75% threshold for reducing multxact_freeze_table_age to
> zero is fine, but I don't agree with the 50% cutoff.  The purpose of
> autovacuum_multixact_freeze_max_age is to control the fraction of the
> 2^32-entry offset space that can be consumed before we begin viewing
> the problem as urgent.  We have a setting for that because it needs to
> be tunable, and the default value for that setting is 400 million,
> which is roughly 10% of the members space.  That is a whole lot lower
> than the 50% threshold you are proposing here.  Moreover, it leaves
> the user with no meaningful choice: if the 50% threshold consumes too
> much disk space, or doesn't leave enough room before we hit the wall,
> then the user is simply hosed.  This is why I initially proposed that
> the member-space-consumption-percentage at which we start derating
> multixact_freeze_table_age should be based on
> autovacuum_multixact_freeze_max_age/2^32.  That way,
> autovacuum_multixact_freeze_max_age controls not only how aggressively
> we try to reclaim offset space but also how aggressively we try to
> reclaim member space.  The user can then tune the value, and the
> default is the same in both cases.

Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts
before wraparound vacuum, which is ~10% of 2^32, we would interpret
that to mean 400 million multixacts OR ~10% * some_constant of member
space, in other worlds autovacuum_freeze_max_age * some_constant
members, whichever comes first.  But what should some_constant be?  If
it's 1, then we are only allowed 400 million members for our 400
million multixacts, which is not enough, if I am right in assuming
that multixacts (nearly?) always have 2 or more members: a setting of
400 million would trigger wraparound at 200 million multixacts (or
less) for *everyone* and trigger wraparound scans for some users as
soon as they upgrade.  If it's 2, then anyone exceeding an average of
2 members would get their effective autovacuum_freeze_max_age lowered.
If it's 8, then people using the default max age of 400 million will
hit the 75% usage threshold that we have decided should trigger
freeze-everything-mode just as they hit the cutoff at which we start
doing anything, and for people who have raised the default value it
would be after (ahh, I might have the tests the wrong way around for
this scenario in MultiXactCheckMemberUsage() in the current patch...).
What do we know about expected multixact sizes in the wild, and what
sort of values people tend to set their
autovacuum_multixact_freeze_max_age to?  Does anyone have a principled
suggestion for some_constant?

> I also think that halving the numbers for manual vacuums is arbitrary
> and unprecedented.  The thought process isn't bad, but an autovacuum
> currently behaves in most respects like a manual vacuum, and I'm
> reluctant to make those more different.

Ok, removed.

On Tue, May 5, 2015 at 9:30 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> the values would be at their normal defaults when the wraparound is
> distant, and the multiplier would start to become slightly less than 1
> as the counter moves towards wraparound; by the time we're at at an
> emergency i.e. we reach max_freeze_age, the values naturally become zero
> (or perhaps just before we reach max_freeze_age, the values were 50% of
> their normal values, so the drop to zero is not as dramatic).  Since
> this is gradual, the behavior is not as jumpy as in the proposed patch.
>
> Anyway this is in line with what Kevin is saying elsewhere: we shouldn't
> just use the normal values all the time just up to the freeze_max_age
> point; there should be some gradual ramp-up.

Ok, in the new patch multixact_freeze_min_age is ramped.

> Perhaps we can combine this with the other idea of using a multiplier
> connected to average size of multixact, if it doesn't become too
> complicated, surprising, or both.

What bothers me about this is that the average multixact size may
suddenly change, for example when new application code rolls out or a
nightly job runs, leading to a sudden change in member space
consumption, you might get in trouble by trying to be too clever with
predictions instead of just using something really simple based on
thresholds.  But I might not be understanding what you meant...

On Mon, May 4, 2015 at 6:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think setting multixact_freeze_min_age to zero could be too aggresive
> for I/O.  Yes with this you can get maximum benefit, but at cost of
> increased I/O.  How would you justify setting it to zero as appropriate
> w.r.t increased I/O?

Changed back to ramped, see above.

> Isn't it better to use minimum to already computed value of
> multixact_freeze_max_age and max_multixact_age_to_avoid_member_wrap?

Ok, done.

On Tue, May 5, 2015 at 8:59 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> As a secondary point, I find variable-names-as-documentation bad
> practice.  Please don't use a long name such as
> max_multixact_age_to_avoid_member_wrap; code becomes unwieldy.  A short
> name such as safe_mxact_age preceded by a comment /* this variable is
> the max that avoids member wrap */ seems more palatable; side-by-side
> merges and all that!  I don't think long function names are as
> problematic (though the name of your new function is still a bit too
> long).

I changed the function name to MultiXactCheckMemberUsage and moved
into multixact.c, and changed the variable name in various places to
safe_multixact_age.  There were already a few comments explaining its
purpose.

> Please note that 9.4 and earlier do not have ExecVacuum; the
> determination of freeze ages is done partly in gram.y (yuck).  Not sure
> what will the patch look like in those branches.

Autovacuum seems to parameterize vacuum via parser statements (!) so
why not just do the same if we detect that we arrived in vacuum via a
user command, and it wasn't a VACUUM FREEZE?  In that case I think the
values arrive as -1.  Like in the attached patch against 9.3 (not
tested).

--
Thomas Munro
http://www.enterprisedb.com

Attachment
Amit Kapila wrote:
> On Tue, May 5, 2015 at 2:29 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
> >
> >
> > Please note that 9.4 and earlier do not have ExecVacuum; the
> > determination of freeze ages is done partly in gram.y (yuck).  Not sure
> > what will the patch look like in those branches.
>
> One way to make fix back-patchable is to consider doing the changes
> for Vacuum and AutoVacuum in one common path (vacuum_set_xid_limits())?
> However, I think we might need to distinguish whether the call is from
> Vacuum or AutoVacuum path.

I think it's easier if we just adjust the patch in older branches to
affect the code that now lives in ExecVacuum.  Trying to make all
branches the same will probably make the whole thing more complicated,
for no real purpose.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, May 5, 2015 at 3:58 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Ok, the new patch uses 25% as the safe threshold, and then scales
> multixact_freeze_table_age down from the current number of active
> multixacts (ie to select the minimum number of tables) progressively
> to 0 (to select all tables) when you reach 75% usage.

I definitely think that 25% is better than 50%.  But see below.

> Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts
> before wraparound vacuum, which is ~10% of 2^32, we would interpret
> that to mean 400 million multixacts OR ~10% * some_constant of member
> space, in other worlds autovacuum_freeze_max_age * some_constant
> members, whichever comes first.  But what should some_constant be?

some_constant should be all the member space there is.  So we trigger
autovac if we've used more than ~10% of the offsets OR more than ~10%
of the members.  Why is autovacuum_multixact_freeze_max_age
configurable in the place?  It's configurable so that you can set it
low enough that wraparound scans complete and advance the minmxid
before you hit the wall, but high enough to avoid excessive scanning.
The only problem is that it only lets you configure the amount of
headroom you need for offsets, not members.  If you squint at what I'm
proposing the right way, it's essentially that that GUC should control
both of those things.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, May 6, 2015 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, May 5, 2015 at 3:58 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>> Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts
>> before wraparound vacuum, which is ~10% of 2^32, we would interpret
>> that to mean 400 million multixacts OR ~10% * some_constant of member
>> space, in other worlds autovacuum_freeze_max_age * some_constant
>> members, whichever comes first.  But what should some_constant be?
>
> some_constant should be all the member space there is.  So we trigger
> autovac if we've used more than ~10% of the offsets OR more than ~10%
> of the members.  Why is autovacuum_multixact_freeze_max_age
> configurable in the place?  It's configurable so that you can set it
> low enough that wraparound scans complete and advance the minmxid
> before you hit the wall, but high enough to avoid excessive scanning.
> The only problem is that it only lets you configure the amount of
> headroom you need for offsets, not members.  If you squint at what I'm
> proposing the right way, it's essentially that that GUC should control
> both of those things.

But member space *always* grows at least twice as fast as offset space
(aka active multixact IDs), because multixacts always have at least 2
members (except in some rare cases IIUC), don't they?  So if we do
what you just said, then we'll trigger wraparound vacuums twice as
soon as we do now for everybody, even people who don't have any
problem with member space management.  We don't want this patch to
change anything for most people, let alone everyone.  So I think that
some_constant should be at least 2, if we try to do it this way, in
other words if you set the GUC for 10% of offset space, we also start
triggering wraparounds at 20% of member space.  The code in
MultiXactCheckMemberSpace would just say safe_member_count =
autovacum_multixact_freeze_max_age * 2, where 2 is some_constant (this
number is the average number of multixact members below which your
workload will be unaffected by the new autovac behaviour).

--
Thomas Munro
http://www.enterprisedb.com
Thomas Munro <thomas.munro@enterprisedb.com> wrote:
> On Wed, May 6, 2015 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, May 5, 2015 at 3:58 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>> Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts
>>> before wraparound vacuum, which is ~10% of 2^32, we would interpret
>>> that to mean 400 million multixacts OR ~10% * some_constant of member
>>> space, in other worlds autovacuum_freeze_max_age * some_constant
>>> members, whichever comes first.  But what should some_constant be?
>>
>> some_constant should be all the member space there is.  So we trigger
>> autovac if we've used more than ~10% of the offsets OR more than ~10%
>> of the members.  Why is autovacuum_multixact_freeze_max_age
>> configurable in the place?  It's configurable so that you can set it
>> low enough that wraparound scans complete and advance the minmxid
>> before you hit the wall, but high enough to avoid excessive scanning.
>> The only problem is that it only lets you configure the amount of
>> headroom you need for offsets, not members.  If you squint at what I'm
>> proposing the right way, it's essentially that that GUC should control
>> both of those things.
>
> But member space *always* grows at least twice as fast as offset space
> (aka active multixact IDs), because multixacts always have at least 2
> members (except in some rare cases IIUC), don't they?  So if we do
> what you just said, then we'll trigger wraparound vacuums twice as
> soon as we do now for everybody, even people who don't have any
> problem with member space management.  We don't want this patch to
> change anything for most people, let alone everyone.

That, I think, is what has been driving this patch away from just
considering the *_multixact_* settings as applying to both the
members SLRU and the offsets SLRU; that would effectively simply
change the monitored resource from one to the other.  (We would
probably want to actually use the max of the two, just to be safe,
but that offsets might never actually be the trigger.)  As Thomas
says, that would be a big change for everyone, and not everyone
necessarily *wants* their existing settings to have new and
different meanings.

> So I think that
> some_constant should be at least 2, if we try to do it this way, in
> other words if you set the GUC for 10% of offset space, we also start
> triggering wraparounds at 20% of member space.

But what if they configure it to start at 80% (which I *have* seen
people do)?

The early patches were a heuristic to attempt to allow current
behavior for those not getting into trouble, and gradually ramp up
aggressiveness as needed to prevent hitting the hard ERROR that now
prevents wraparound.  Perhaps, rather than reducing the threshold
gradually, as the members SLRU approaches wraparound we could
gradually shift from using offsets to members as the number we
compare the thresholds to.  Up to 25% of maximum members, or if
offset is somehow larger, we just use offsets; else above 75%
maximum members we use members; else we use a weighted average
based on how far we are between 25% and 75%.  It's kinda weird, but
I think it gives us a reasonable way to ramp up up vacuum
aggressiveness from what we currently do toward what Robert
proposed based on whether the workload is causing things to head
for trouble.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, May 6, 2015 at 9:46 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, May 6, 2015 at 9:26 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, May 5, 2015 at 3:58 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>> Ok, so if you have autovacuum_freeze_max_age = 400 million multixacts
>>> before wraparound vacuum, which is ~10% of 2^32, we would interpret
>>> that to mean 400 million multixacts OR ~10% * some_constant of member
>>> space, in other worlds autovacuum_freeze_max_age * some_constant
>>> members, whichever comes first.  But what should some_constant be?
>>
>> some_constant should be all the member space there is.  So we trigger
>> autovac if we've used more than ~10% of the offsets OR more than ~10%
>> of the members.  Why is autovacuum_multixact_freeze_max_age
>> configurable in the place?  It's configurable so that you can set it
>> low enough that wraparound scans complete and advance the minmxid
>> before you hit the wall, but high enough to avoid excessive scanning.
>> The only problem is that it only lets you configure the amount of
>> headroom you need for offsets, not members.  If you squint at what I'm
>> proposing the right way, it's essentially that that GUC should control
>> both of those things.
>
> But member space *always* grows at least twice as fast as offset space
> (aka active multixact IDs), because multixacts always have at least 2
> members (except in some rare cases IIUC), don't they?  So if we do
> what you just said, then we'll trigger wraparound vacuums twice as
> soon as we do now for everybody, even people who don't have any
> problem with member space management.  We don't want this patch to
> change anything for most people, let alone everyone.  So I think that
> some_constant should be at least 2, if we try to do it this way, in
> other words if you set the GUC for 10% of offset space, we also start
> triggering wraparounds at 20% of member space.  The code in
> MultiXactCheckMemberSpace would just say safe_member_count =
> autovacum_multixact_freeze_max_age * 2, where 2 is some_constant (this
> number is the average number of multixact members below which your
> workload will be unaffected by the new autovac behaviour).

Here is a version of the patch that uses the GUC to control where
safe_member_count starts as you suggested.  But it multiplies it by a
scaling factor that I'm calling avg_multixact_size_threshold.  The key
point is this: if your average multixact size is below this number,
you will see no change in autovacuum behaviour from this patch,
because normal wraparound vacuums will occur before you ever exceed
safe_member_count, but if your average multixact size is above this
number, you'll see some extra wraparound vacuuming.

As for its value, I start the bidding at 3.  Unless I misunderstood,
your proposal amounted to using a value of 1, but I think that is too
small, because surely *everyone's* average multixact size is above 1.
We don't want to change the autovacuum behaviour for everyone.   I
figure the value should be a smallish number of at least 2, and I
speculate that that multixacts might have one of those kinds of
distributions where there are lots of 2s, a few 3s, hardly any 4s etc,
so that the average would be somewhere not far above 2.  3 seems to
fit that description.  I could be completely wrong about that, but
even so, with the default GUC setting of 400 million, 3 gives you
(400000000 * 3) / 2^32 = ~28%, pretty close to the 25% that people
seemed to like when we were talking about a fixed constant.  Higher
numbers could work but would make us less aggressive and increase the
wraparound risk, and 8 would definitely be too high, because then
safe_member_count crashes into dangerous_member_count with the default
GUC setting, so our ramping is effectively disabled giving us a panic
mode/cliff.

The new patch also tests the dangerous level *before* the safe level,
so that our 75% threshold is still triggered even if you set the GUC
insanely high so that safe_member_count finished up higher than
dangerous_member_count.

BTW I think you can estimate the average number of members per
multixact in a real database like this:

number of members = number of member segment files * 1636 * 32
over
number of multixacts = number of offsets segment files * 2048 * 32

Those numbers are from these macros with a default page size:

MULTIXACT_MEMBERS_PER_PAGE = 1636
MULTIXACT_OFFSETS_PER_PAGE = 2048
SLRU_PAGES_PER_SEGMENT = 32

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Tue, May 5, 2015 at 5:46 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> But member space *always* grows at least twice as fast as offset space
> (aka active multixact IDs), because multixacts always have at least 2
> members (except in some rare cases IIUC), don't they?

Oh.  *facepalm*

All right, so maybe the way you had it is best after all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, May 5, 2015 at 6:36 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
>> So I think that
>> some_constant should be at least 2, if we try to do it this way, in
>> other words if you set the GUC for 10% of offset space, we also start
>> triggering wraparounds at 20% of member space.
>
> But what if they configure it to start at 80% (which I *have* seen
> people do)?

I might be confused here, but the upper limit for
autovacuum_multixact_freeze_max_age is 2 billion, so I don't think
this can ever be higher than 50%. Well, 46.5%, really, since 2^32 > 4
billion.  autovacuum_freeze_max_age is similarly limited.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Tue, May 5, 2015 at 3:58 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Here's a new patch, with responses to several reviews.

Going back to this version...

+ * Based on the assumption that there is no reasonable way for an end user to
+ * configure the thresholds for this, we define the safe member count to be
+ * half of the member address space, and the dangerous level to be

but:

+       const MultiXactOffset safe_member_count = MaxMultiXactOffset / 4;

Those don't match.  Also, we usually use #define rather than const for
constants.  I suggest we do that here, too.

+               int safe_multixact_age = MultiXactCheckMemberUsage();
+               if (safe_multixact_age >= 0)

Project style is to leave a blank line between these, I think.

I think you need to update the comments for relation_needs_vacanalyze().

The documentation in section 23.1.5.1, "Multixacts and Wraparound",
also needs updating.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, May 6, 2015 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> + * Based on the assumption that there is no reasonable way for an end user to
> + * configure the thresholds for this, we define the safe member count to be
> + * half of the member address space, and the dangerous level to be
>
> but:
>
> +       const MultiXactOffset safe_member_count = MaxMultiXactOffset / 4;
>
> Those don't match. [...]

Fixed/obsoleted in the attached patch.  It has a dynamic
safe_member_count based on scaling the GUC as described in my earlier
email with the v7 patch; the behaviour with the default GUC value
works out to a similar safe_member_count value, but this way it can be
changed if needed, and we don't introduce any new GUCs.  Also, since
the GUC used in determining safe_member_count is either
autovacuum_multixact_freeze_max_age or vacuum_multixact_freeze_max_age
depending on which kind of vacuum it is, that is now a parameter
passed into MultiXactCheckMemberUsage, so safe_member_count is no
longer a constant.

> [...] Also, we usually use #define rather than const for
> constants.  I suggest we do that here, too.

Done for DANGEROUS_MEMBER_COUNT and AVG_MULTIXACT_SIZE_THRESHOLD which
are still constants.

> +               int safe_multixact_age = MultiXactCheckMemberUsage();
> +               if (safe_multixact_age >= 0)
>
> Project style is to leave a blank line between these, I think.

Done.

> I think you need to update the comments for relation_needs_vacanalyze().

Done.

> The documentation in section 23.1.5.1, "Multixacts and Wraparound",
> also needs updating.

Done.  I also rewrote some of the commentary in
MultiXactCheckMemberUsage, to try to make the theory behind the ramp
up algorithm clearer.

I will do some more testing.  Does anyone have any feedback on the
choice of value for AVG_MULTIXACT_SIZE_THRESHOLD, or real world data
on average multixact sizes, or see more general problems?  The idea
here is that you should only see autovacuum behaviour change if it's
more than AVG_MULTIXACT_SIZE_THRESHOLD.

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Wed, May 6, 2015 at 6:26 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Wed, May 6, 2015 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> + * Based on the assumption that there is no reasonable way for an end user to
>> + * configure the thresholds for this, we define the safe member count to be
>> + * half of the member address space, and the dangerous level to be
>>
>> but:
>>
>> +       const MultiXactOffset safe_member_count = MaxMultiXactOffset / 4;
>>
>> Those don't match. [...]
>
> Fixed/obsoleted in the attached patch.  It has a dynamic
> safe_member_count based on scaling the GUC as described in my earlier
> email with the v7 patch; the behaviour with the default GUC value
> works out to a similar safe_member_count value, but this way it can be
> changed if needed, and we don't introduce any new GUCs.  Also, since
> the GUC used in determining safe_member_count is either
> autovacuum_multixact_freeze_max_age or vacuum_multixact_freeze_max_age
> depending on which kind of vacuum it is, that is now a parameter
> passed into MultiXactCheckMemberUsage, so safe_member_count is no
> longer a constant.

To be honest, now that you've pointed out that the fraction of the
multixact members space that is in use will always be larger,
generally much larger, than the fraction of the offset space that is
in use, I've kind of lost all enthusiasm for making the
safe_member_count stuff dependent on
autovacuum_multixact_freeze_max_age.  I'm inclined to go back to 25%,
the way you had it before.

We could think about adding a new GUC in master, but I'm actually
leaning toward the view that we should just hard-code 25% for now and
consider revising it later if that proves inadequate.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, May 6, 2015 at 3:56 PM, Thomas Munro <thomas.munro@enterprisedb.com>
wrote:
>
> On Wed, May 6, 2015 at 2:29 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Few comments:

1.
+ /*
+ * Override the multixact freeze settings if we are running out of
+ * member address space.
+ */
+ if (safe_multixact_age >= 0)
+ {
+ multixact_freeze_table_age = Min(safe_multixact_age,
+ multixact_freeze_table_age);

+ /* Special settings if we are running out of member address space. */
+ if (safe_multixact_age >= 0)
+ multixact_freeze_max_age = Min(multixact_freeze_max_age,
safe_multixact_age);
+


Some places use safe_multixact_age as first parameter and some
places use it at second place.  I think it is better to use in same
order for the sake of consistency.

2.
in the hope
+ * that different tables will be vacuumed at different times due to their
+ * varying relminmxid values.

Does above line in comment on top of MultiXactCheckMemberUsage()
makes much sense?



3.
+ * we know the age of the oldest multixact in the system, so that's the
+ * value we want to when members is near safe_member_count.  It should

typo.
so that's the value we want to *use* when ..




With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Wed, May 6, 2015 at 6:45 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> To be honest, now that you've pointed out that the fraction of the
> multixact members space that is in use will always be larger,
> generally much larger, than the fraction of the offset space that is
> in use, I've kind of lost all enthusiasm for making the
> safe_member_count stuff dependent on
> autovacuum_multixact_freeze_max_age.  I'm inclined to go back to 25%,
> the way you had it before.
>
> We could think about adding a new GUC in master, but I'm actually
> leaning toward the view that we should just hard-code 25% for now and
> consider revising it later if that proves inadequate.

So here's a new patch, based on your latest version, which looks
reasonably committable to me.  I made a lot of revisions to the
comments.  I changed some variable and function names, too.  I also
reworked the function that determines the accelerated freeze threshold
because I was afraid that the integer return value might not be large
enough to accommodate the uint32 result of the derating calculation,
which would then turn negative.  The new logic is also, I believe,
less vulnerable to floating-point roundoff dangers.  And I tightened
up the documentation.

Some residual concerns:

1. Should we be installing one or more GUCs to control this behavior?
I've gone back to hard-coding things so that at 25% we start
triggering autovacuum and by 75% we zero out the freeze ages, because
the logic you proposed in your last version looks insanely complicated
to me.  (I do realize that I suggested the approach, but that was
before I realized the full complexity of the problem.)  I now think
that if we want to make this tunable, we need to create and expose
GUCs for it.  I'm hoping we can get by without that, but I'm not sure.

2. Doesn't the code that sets MultiXactState->multiVacLimit also need
to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
similar logic?  Otherwise, a user with autovacuum=off won't get
emergency autovacuums for member exhaustion, even though they will get
them for offset exhaustion.

Despite those concerns I think we're headed in the right direction here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
I haven't read your patch, but I wonder if we should decrease the
default value of multixact_freeze_table_age (currently 150 million).
The freeze_min_age is 5 million; if freeze_table_age were a lot lower,
the problem would be less pronounced.

Additionally, I will backpatch commit 27846f02c176.  The average size of
multixacts decreases with that fix in many common cases, which greatly
reduces the need for any of this in the first place.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Robert Haas wrote:

> So here's a new patch, based on your latest version, which looks
> reasonably committable to me.

I think this code should also reduce the multixact_freeze_min_age value
at the same time as multixact_freeze_table_age.  If the table age is
reduced but freeze_min_age remains high, old multixacts might still
remain in the table.  The default value for freeze min age is 5 million,
but users may change it.  Perhaps freeze min age should be set to
Min(modified freeze table age, freeze min age) so that old multixacts
are effectively frozen whenever a full table scan requested.

> 1. Should we be installing one or more GUCs to control this behavior?
> I've gone back to hard-coding things so that at 25% we start
> triggering autovacuum and by 75% we zero out the freeze ages, because
> the logic you proposed in your last version looks insanely complicated
> to me.  (I do realize that I suggested the approach, but that was
> before I realized the full complexity of the problem.)  I now think
> that if we want to make this tunable, we need to create and expose
> GUCs for it.  I'm hoping we can get by without that, but I'm not sure.

I think things are complicated enough; I vote for no additional GUCs at
this point.

> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
> similar logic?  Otherwise, a user with autovacuum=off won't get
> emergency autovacuums for member exhaustion, even though they will get
> them for offset exhaustion.

Yeah, it looks like it does.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:

>> So here's a new patch, based on your latest version, which looks
>> reasonably committable to me.
>
> I think this code should also reduce the multixact_freeze_min_age value
> at the same time as multixact_freeze_table_age.  If the table age is
> reduced but freeze_min_age remains high, old multixacts might still
> remain in the table.  The default value for freeze min age is 5 million,
> but users may change it.  Perhaps freeze min age should be set to
> Min(modified freeze table age, freeze min age) so that old multixacts
> are effectively frozen whenever a full table scan requested.

I would rather see min age reduced proportionally to table age, or
at least ensure that min age is some percentage below table age.

>> 1. Should we be installing one or more GUCs to control this behavior?
>> I've gone back to hard-coding things so that at 25% we start
>> triggering autovacuum and by 75% we zero out the freeze ages, because
>> the logic you proposed in your last version looks insanely complicated
>> to me.  (I do realize that I suggested the approach, but that was
>> before I realized the full complexity of the problem.)  I now think
>> that if we want to make this tunable, we need to create and expose
>> GUCs for it.  I'm hoping we can get by without that, but I'm not sure.
>
> I think things are complicated enough; I vote for no additional GUCs at
> this point.

+1

For one thing, we should try to have something we can back-patch,
and new GUCs in a minor release seems like something to avoid, if
possible.  For another thing, we've tended not to put in GUCs if
there is no reasonable way for a user to determine a good value,
and that seems to be the case here.

>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
>> similar logic?  Otherwise, a user with autovacuum=off won't get
>> emergency autovacuums for member exhaustion, even though they will get
>> them for offset exhaustion.
>
> Yeah, it looks like it does.

+1

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Wed, May 6, 2015 at 10:34 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> So here's a new patch, based on your latest version, which looks
>> reasonably committable to me.
>
> I think this code should also reduce the multixact_freeze_min_age value
> at the same time as multixact_freeze_table_age.

I think it does that.  It sets the min age to half the value it sets
for the table age, which I think is consistent with what we do
elsewhere.

>> 1. Should we be installing one or more GUCs to control this behavior?
>> I've gone back to hard-coding things so that at 25% we start
>> triggering autovacuum and by 75% we zero out the freeze ages, because
>> the logic you proposed in your last version looks insanely complicated
>> to me.  (I do realize that I suggested the approach, but that was
>> before I realized the full complexity of the problem.)  I now think
>> that if we want to make this tunable, we need to create and expose
>> GUCs for it.  I'm hoping we can get by without that, but I'm not sure.
>
> I think things are complicated enough; I vote for no additional GUCs at
> this point.

That's fine with me for now.

>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
>> similar logic?  Otherwise, a user with autovacuum=off won't get
>> emergency autovacuums for member exhaustion, even though they will get
>> them for offset exhaustion.
>
> Yeah, it looks like it does.

OK, I'm not clear how to do that correctly, exactly, but hopefully one
of us can figure that out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, May 7, 2015 at 4:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, May 6, 2015 at 10:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
>>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
>>> similar logic?  Otherwise, a user with autovacuum=off won't get
>>> emergency autovacuums for member exhaustion, even though they will get
>>> them for offset exhaustion.
>>
>> Yeah, it looks like it does.
>
> OK, I'm not clear how to do that correctly, exactly, but hopefully one
> of us can figure that out.

MultiXactState->multiVacLimit holds the multixact IDs at which an age
thresholds will be crossed, so that GetNewMultiXactId can check it
cheaply.  But we can't predict the future multixact IDs at which our
member usage threshold will be crossed.  We could try to estimate it
based on past multixact sizes, but (as I think we already covered
somewhere else) we shouldn't be trying to do that because it wouldn't
handle the situation where your member space consumption rate suddenly
went up, among other problems.

How about this:  we add oldestOffset to MultiXactState, to be set by
DetermineSafeOldestOffset, and then at the place where
GetNewMultiXactId checks if (!MultiXactIdPrecedes(result,
MultiXactState->multiVacLimit) it could also check whether (nextOffset
- MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD).
ReadMultiXactCounts should also use the oldestOffset value directly
from shmem instead of calling find_multixact_start.

--
Thomas Munro
http://www.enterprisedb.com
On May 6, 2015, at 9:48 PM, Thomas Munro <thomas.munro@enterprisedb.com> wro=
te:
>> On Thu, May 7, 2015 at 4:23 AM, Robert Haas <robertmhaas@gmail.com> wrote=
:
>> On Wed, May 6, 2015 at 10:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com=
> wrote:
>>>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
>>>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
>>>> similar logic?  Otherwise, a user with autovacuum=3Doff won't get
>>>> emergency autovacuums for member exhaustion, even though they will get
>>>> them for offset exhaustion.
>>>=20
>>> Yeah, it looks like it does.
>>=20
>> OK, I'm not clear how to do that correctly, exactly, but hopefully one
>> of us can figure that out.
>=20
> MultiXactState->multiVacLimit holds the multixact IDs at which an age
> thresholds will be crossed, so that GetNewMultiXactId can check it
> cheaply.  But we can't predict the future multixact IDs at which our
> member usage threshold will be crossed.  We could try to estimate it
> based on past multixact sizes, but (as I think we already covered
> somewhere else) we shouldn't be trying to do that because it wouldn't
> handle the situation where your member space consumption rate suddenly
> went up, among other problems.
>=20
> How about this:  we add oldestOffset to MultiXactState, to be set by
> DetermineSafeOldestOffset, and then at the place where
> GetNewMultiXactId checks if (!MultiXactIdPrecedes(result,
> MultiXactState->multiVacLimit) it could also check whether (nextOffset
> - MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD).
> ReadMultiXactCounts should also use the oldestOffset value directly
> from shmem instead of calling find_multixact_start.

That sounds pretty good.

...Robert=
On Thu, May 7, 2015 at 2:44 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On May 6, 2015, at 9:48 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>> On Thu, May 7, 2015 at 4:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Wed, May 6, 2015 at 10:34 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>>>> 2. Doesn't the code that sets MultiXactState->multiVacLimit also need
>>>>> to use what I'm now calling MultiXactMemberFreezeThreshold() - or some
>>>>> similar logic?  Otherwise, a user with autovacuum=off won't get
>>>>> emergency autovacuums for member exhaustion, even though they will get
>>>>> them for offset exhaustion.
>>>>
>>>> Yeah, it looks like it does.
>>>
>>> OK, I'm not clear how to do that correctly, exactly, but hopefully one
>>> of us can figure that out.
>>
>> MultiXactState->multiVacLimit holds the multixact IDs at which an age
>> thresholds will be crossed, so that GetNewMultiXactId can check it
>> cheaply.  But we can't predict the future multixact IDs at which our
>> member usage threshold will be crossed.  We could try to estimate it
>> based on past multixact sizes, but (as I think we already covered
>> somewhere else) we shouldn't be trying to do that because it wouldn't
>> handle the situation where your member space consumption rate suddenly
>> went up, among other problems.
>>
>> How about this:  we add oldestOffset to MultiXactState, to be set by
>> DetermineSafeOldestOffset, and then at the place where
>> GetNewMultiXactId checks if (!MultiXactIdPrecedes(result,
>> MultiXactState->multiVacLimit) it could also check whether (nextOffset
>> - MultiXactState->oldestOffset > MULTIXACT_MEMBER_SAFE_THRESHOLD).
>> ReadMultiXactCounts should also use the oldestOffset value directly
>> from shmem instead of calling find_multixact_start.
>
> That sounds pretty good.

See attached patch, based on your multixact-av.patch.  With autovacuum
set to off, it vacuums as expected.  I wonder if
DetermineSafeOdlestOffset is being called in all the right places to
guarantee that the state is initialised.

This patch will change it anyway, but I noticed that oldestOffset's
computation to find the start of the segment seems wrong in master, I
think it should be like this, no?

@@ -2495,7 +2495,7 @@ DetermineSafeOldestOffset(MultiXactId oldestMXact)
         */
        oldestOffset = find_multixact_start(oldestMXact);
        /* move back to start of the corresponding segment */
-       oldestOffset -= oldestOffset / MULTIXACT_MEMBERS_PER_PAGE *
SLRU_PAGES_PER_SEGMENT;
+       oldestOffset -= oldestOffset % (MULTIXACT_MEMBERS_PER_PAGE *
SLRU_PAGES_PER_SEGMENT);

        LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
        /* always leave one segment before the wraparound point */

--
Thomas Munro
http://www.enterprisedb.com

Attachment
Thomas Munro <thomas.munro@enterprisedb.com> wrote:

> I noticed that oldestOffset's computation to find the start of
> the segment seems wrong in master, I think it should be like
> this, no?

The change you show looks to me like the right way to take
oldestOffset and reduce to the first on a page.  Without your
suggested change the calculation lands the value at a location that
has no real meaning.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Thu, May 7, 2015 at 12:23 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> This patch will change it anyway, but I noticed that oldestOffset's
> computation to find the start of the segment seems wrong in master, I
> think it should be like this, no?
>
> @@ -2495,7 +2495,7 @@ DetermineSafeOldestOffset(MultiXactId oldestMXact)
>          */
>         oldestOffset = find_multixact_start(oldestMXact);
>         /* move back to start of the corresponding segment */
> -       oldestOffset -= oldestOffset / MULTIXACT_MEMBERS_PER_PAGE *
> SLRU_PAGES_PER_SEGMENT;
> +       oldestOffset -= oldestOffset % (MULTIXACT_MEMBERS_PER_PAGE *
> SLRU_PAGES_PER_SEGMENT);
>
>         LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
>         /* always leave one segment before the wraparound point */

This should be committed and back-patched separately, I think.  I'll
go do that now.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Kevin Grittner <kgrittn@ymail.com> wrote:

> The change you show looks to me like the right way to take
> oldestOffset and reduce to the first on a page.

Of course, I meant the first offset on the first page of a segment.


--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas wrote:
> On Thu, May 7, 2015 at 12:23 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
> > This patch will change it anyway, but I noticed that oldestOffset's
> > computation to find the start of the segment seems wrong in master, I
> > think it should be like this, no?

> This should be committed and back-patched separately, I think.  I'll
> go do that now.

I was going to say the same.  Thanks.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, May 7, 2015 at 11:18 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> On Thu, May 7, 2015 at 12:23 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>> > This patch will change it anyway, but I noticed that oldestOffset's
>> > computation to find the start of the segment seems wrong in master, I
>> > think it should be like this, no?
>
>> This should be committed and back-patched separately, I think.  I'll
>> go do that now.
>
> I was going to say the same.  Thanks.

OK, so here is what's in Thomas's latest version that's not already
committed and different from my last version.  It should apply on top
of my multixact-av.patch, but I attach it here as a separate extract
for ease of review.  I think what we need to determine is whether this
is the right fix for the problem of starting emergency autovacuums for
member space exhaustion even when autovacuum=off.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
On Thu, May 7, 2015 at 11:30 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> OK, so here is what's in Thomas's latest version that's not already
> committed and different from my last version.  It should apply on top
> of my multixact-av.patch, but I attach it here as a separate extract
> for ease of review.  I think what we need to determine is whether this
> is the right fix for the problem of starting emergency autovacuums for
> member space exhaustion even when autovacuum=off.

Nope, I extracted that incorrectly.  Second try attached.

I think Thomas's question upthread is a very good one:

> See attached patch, based on your multixact-av.patch.  With autovacuum
> set to off, it vacuums as expected.  I wonder if
> DetermineSafeOdlestOffset is being called in all the right places to
> guarantee that the state is initialised.

It seems to me that the most obvious places where
DetermineSafeOldestOffset() should be called are (1) at startup or
after recovery, to initialize the value; and (2) each time we truncate
the SLRU, to update the value.  Other than that, this doesn't change.
The startup calls are there, in apparently reasonable places, but it's
not obvious to me how this gets called in the TruncateMultiXact path.
Instead it seems to get set via the SetMultiXactIdLimit path.  Maybe
that's OK, but it would seem to imply that we're OK with overwriting
old members information if that information was slated for truncation
at the next checkpoint anyway, which seems scary.  Wouldn't it be
better to only advance the threshold once the checkpoint completes and
the truncation is done?

Maybe I'm misreading something here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
Robert Haas wrote:

> It seems to me that the most obvious places where
> DetermineSafeOldestOffset() should be called are (1) at startup or
> after recovery, to initialize the value; and (2) each time we truncate
> the SLRU, to update the value.  Other than that, this doesn't change.
> The startup calls are there, in apparently reasonable places, but it's
> not obvious to me how this gets called in the TruncateMultiXact path.
> Instead it seems to get set via the SetMultiXactIdLimit path.  Maybe
> that's OK, but it would seem to imply that we're OK with overwriting
> old members information if that information was slated for truncation
> at the next checkpoint anyway, which seems scary.

I considered this question in the previous commit: is it okay to
overwrite a file that is no longer used (per the limits set by vacuum)
but not yet removed (by checkpoint)?  It seems to me that there is no
data-loss issue with doing that -- which is why the advance-oldest code
is called during vacuum and not during checkpoint.

> Wouldn't it be better to only advance the threshold once the
> checkpoint completes and the truncation is done?

*shrug*


There's also a question about this function's "else" branch:

void
MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
{
    if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
        SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
    else
        DetermineSafeOldestOffset(oldestMulti);
}

The reason the "else" is there is that I chickened out about not calling
DetermineSafeOldestOffset() (which SetMultiXactIdLimit calls) when the
oldestMulti does not advance; what if in the previous run we failed to
get to this point for whatever reason?  The calls to SetMultiXactIdLimit
are seldom enough that it seemed better to be protected against previous
runs not finishing.  Then again, I might be worrying over nothing and
this is just dead code.

Also, if we have extra calls in other places after Thomas' patch,
perhaps this one is not necessary.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, May 7, 2015 at 2:00 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>> It seems to me that the most obvious places where
>> DetermineSafeOldestOffset() should be called are (1) at startup or
>> after recovery, to initialize the value; and (2) each time we truncate
>> the SLRU, to update the value.  Other than that, this doesn't change.
>> The startup calls are there, in apparently reasonable places, but it's
>> not obvious to me how this gets called in the TruncateMultiXact path.
>> Instead it seems to get set via the SetMultiXactIdLimit path.  Maybe
>> that's OK, but it would seem to imply that we're OK with overwriting
>> old members information if that information was slated for truncation
>> at the next checkpoint anyway, which seems scary.
>
> I considered this question in the previous commit: is it okay to
> overwrite a file that is no longer used (per the limits set by vacuum)
> but not yet removed (by checkpoint)?  It seems to me that there is no
> data-loss issue with doing that -- which is why the advance-oldest code
> is called during vacuum and not during checkpoint.

I think the main question is whether truncation will be smart enough
to zap the non-overwritten part of the old stuff but not the part that
did get overwritten.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Robert Haas wrote:
> On Thu, May 7, 2015 at 2:00 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

> > I considered this question in the previous commit: is it okay to
> > overwrite a file that is no longer used (per the limits set by vacuum)
> > but not yet removed (by checkpoint)?  It seems to me that there is no
> > data-loss issue with doing that -- which is why the advance-oldest code
> > is called during vacuum and not during checkpoint.
>
> I think the main question is whether truncation will be smart enough
> to zap the non-overwritten part of the old stuff but not the part that
> did get overwritten.

TruncateMultiXact bases its removal of member files on the point set by
checkpoint, MultiXactState->lastCheckpointedOldest.  I tested this back
then and there was no issue.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, May 6, 2015 at 9:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> So here's a new patch, based on your latest version, which looks
> reasonably committable to me.  I made a lot of revisions to the
> comments.  I changed some variable and function names, too.  I also
> reworked the function that determines the accelerated freeze threshold
> because I was afraid that the integer return value might not be large
> enough to accommodate the uint32 result of the derating calculation,
> which would then turn negative.  The new logic is also, I believe,
> less vulnerable to floating-point roundoff dangers.  And I tightened
> up the documentation.

Here's a new revision of this patch.  When I looked into back-patching
this, it conflicted in ExecVacuum().  And the more I thought about
that, the less I liked it, because I think ExecVacuum() is really
intended to translate from the parse tree representation to whatever
internal representation we're going to pass to vacuum, NOT to make
policy decisions.  So in the attached version, what I did instead is
modify vacuum_set_xid_limits() to call
MultiXactMemberFreezeThreshold() itself instead of trying to pass that
value down via the vacuum options.   It seems to me that this greatly
reduces the chances of inconsistency between member-wrap vacuums and
offset-wrap vacuums.  The code seems more elegant, too:
MultiXactMemberFreezeThreshold() now returns
autovacuum_multixact_freeze_max_age rather than -1 when no special
handling is needed to prevent member wraparound, and all of the
callers now find this convention convenient.  As a bonus, it now
back-patches to 9.4 without conflicts.  There are some conflicts on
9.3 but they don't look bad.

This will need some testing to make sure it still works; I haven't done that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
On Fri, May 8, 2015 at 6:00 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Robert Haas wrote:
>
>> It seems to me that the most obvious places where
>> DetermineSafeOldestOffset() should be called are (1) at startup or
>> after recovery, to initialize the value; and (2) each time we truncate
>> the SLRU, to update the value.  Other than that, this doesn't change.
>> The startup calls are there, in apparently reasonable places, but it's
>> not obvious to me how this gets called in the TruncateMultiXact path.
>> Instead it seems to get set via the SetMultiXactIdLimit path.  Maybe
>> that's OK, but it would seem to imply that we're OK with overwriting
>> old members information if that information was slated for truncation
>> at the next checkpoint anyway, which seems scary.
>
> I considered this question in the previous commit: is it okay to
> overwrite a file that is no longer used (per the limits set by vacuum)
> but not yet removed (by checkpoint)?  It seems to me that there is no
> data-loss issue with doing that -- which is why the advance-oldest code
> is called during vacuum and not during checkpoint.

Given three concurrent backends that are checkpointing, vacuuming and
creating multixacts, isn't there some ordering of events that could
cause SlruScanDirCbRemoveMembers to delete a segment file that it
judges to be the tail segment, but which has become the new head
segment just before the file gets unlinked?  The comment in
SlrScanDirCbRemoveMembers says "... and we never increase rangeStart
during any one run", but I think that a concurrent vacuum could
advance the oldest multixact and therefore the oldest offset (which is
what range->rangeStart was derived from), and if you're extra unlucky,
a third backend that is creating multixacts could then advance
nextOffset so that it points to a page *past* range->rangeStart, and
we'd start blowing away the wrong data.

I may have the details wrong there, but in general I can't see how you
can delete just the right segment files while the head and tail
pointers of this circular buffer are both moving, unless you hold the
lock to stop that while you make the decision and unlink the file
(which I assume is out of the question), or say 'well if we keep the
head and tail pointers at least 20 pages apart, that's unlikely to
happen' (which I assume is also out of the question).  Now I
understand the suggestion that the checkpoint code could be in charge
of advancing the oldest multixact + offset.

But if we did that, our new autovacuum code would get confused and
keep trying to vacuum stuff that it's already dealt with, until the
next checkpoint... so maybe we'd need to track both (1) the offset of
the oldest multixact: the one that we use to trigger autovacuums, even
though there might be even older segments still on disk, and (2)
offset_of_start_of_oldest_segment_on_disk (suitably compressed): the
one that we use to block wraparound in GetNewMultiXactId, so that we
never write into a file that checkpoint might decide to delete.

> Also, if we have extra calls in other places after Thomas' patch,
> perhaps this one is not necessary.

No extra calls added.

--
Thomas Munro
http://www.enterprisedb.com
On May 7, 2015, at 6:21 PM, Thomas Munro <thomas.munro@enterprisedb.com> wro=
te:
> I may have the details wrong there, but in general I can't see how you
> can delete just the right segment files while the head and tail
> pointers of this circular buffer are both moving, unless you hold the
> lock to stop that while you make the decision and unlink the file
> (which I assume is out of the question), or say 'well if we keep the
> head and tail pointers at least 20 pages apart, that's unlikely to
> happen' (which I assume is also out of the question). =20

That is exactly the sort of thing I was worried about, but I couldn't put my=
 finger on it. I think your analysis is right.

> Now I
> understand the suggestion that the checkpoint code could be in charge
> of advancing the oldest multixact + offset.

Yeah. I think we need to pursue that angle unless somebody has a better idea=
.  It would also address another issue I am concerned about: the current sys=
tem seems to advance the oldest multixact information in shared memory diffe=
rently on the primary and the standby. I'm not sure the logic actually works=
 on the standby at all at this point, but even if it does, it seems unlikely=
 be right to rely on redo to do on the standby what is being done by a compl=
etely different, not-WAL-logged operation on the master.  Making the checkpo=
int do it in both cases would fix that.

> But if we did that, our new autovacuum code would get confused and
> keep trying to vacuum stuff that it's already dealt with, until the
> next checkpoint... so maybe we'd need to track both (1) the offset of
> the oldest multixact: the one that we use to trigger autovacuums, even
> though there might be even older segments still on disk, and (2)
> offset_of_start_of_oldest_segment_on_disk (suitably compressed): the
> one that we use to block wraparound in GetNewMultiXactId, so that we
> never write into a file that checkpoint might decide to delete.

That sounds plausible.


...Robert=
On Fri, May 8, 2015 at 8:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> Here's a new revision of this patch.  When I looked into back-patching
> this, it conflicted in ExecVacuum().  And the more I thought about
> that, the less I liked it, because I think ExecVacuum() is really
> intended to translate from the parse tree representation to whatever
> internal representation we're going to pass to vacuum, NOT to make
> policy decisions.  So in the attached version, what I did instead is
> modify vacuum_set_xid_limits() to call
> MultiXactMemberFreezeThreshold() itself instead of trying to pass that
> value down via the vacuum options.   It seems to me that this greatly
> reduces the chances of inconsistency between member-wrap vacuums and
> offset-wrap vacuums.  The code seems more elegant, too:
> MultiXactMemberFreezeThreshold() now returns
> autovacuum_multixact_freeze_max_age rather than -1 when no special
> handling is needed to prevent member wraparound, and all of the
> callers now find this convention convenient.  As a bonus, it now
> back-patches to 9.4 without conflicts.  There are some conflicts on
> 9.3 but they don't look bad.

This is definitely tidier.

> This will need some testing to make sure it still works; I haven't done that.

I tested it with autovacuum on, and it automatically vacuumed after
25% of member space was used.

I tested it with autovacuum off, and I ran a VACUUM manually every few
minutes and checked that it didn't advance my database's datminmxid,
until after 25% of member space was used, after which a full scan was
triggered and datminmxid moved.  As expected, that didn't advance the
cluster-wide oldest multixact, because for that you'd need to vacuum
ALL databases including non-connectable ones.  (This patch doesn't
update the forced-autovacuum-even-if-autovacuum-is-off feature to
handle member space wraparound in GetNextMultiXactId, which is coming
in a separate patch[1].)

So it works as expected.

[1] http://www.postgresql.org/message-id/CA+TgmobbaQpE6sNqT30+rz4UMH5aPraq20gko5xd2ZGajz1-Jg@mail.gmail.com

--
Thomas Munro
http://www.enterprisedb.com
On Thu, May 7, 2015 at 2:00 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> There's also a question about this function's "else" branch:
>
> void
> MultiXactAdvanceOldest(MultiXactId oldestMulti, Oid oldestMultiDB)
> {
>     if (MultiXactIdPrecedes(MultiXactState->oldestMultiXactId, oldestMulti))
>         SetMultiXactIdLimit(oldestMulti, oldestMultiDB);
>     else
>         DetermineSafeOldestOffset(oldestMulti);
> }
>
> The reason the "else" is there is that I chickened out about not calling
> DetermineSafeOldestOffset() (which SetMultiXactIdLimit calls) when the
> oldestMulti does not advance; what if in the previous run we failed to
> get to this point for whatever reason?  The calls to SetMultiXactIdLimit
> are seldom enough that it seemed better to be protected against previous
> runs not finishing.  Then again, I might be worrying over nothing and
> this is just dead code.

It's dead code for more pedestrian reasons: MultiXactAdvanceOldest()
is called only from xlog_redo(), but does nothing if InRecovery.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, May 8, 2015 at 5:32 AM, Robert Haas wrote:
> because I think ExecVacuum() is really
> intended to translate from the parse tree representation to whatever
> internal representation we're going to pass to vacuum, NOT to make
> policy decisions.

Yep. That's more or less the idea behind it.
--
Michael
On Thu, May 7, 2015 at 7:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Now I
>> understand the suggestion that the checkpoint code could be in charge
>> of advancing the oldest multixact + offset.
>
> Yeah. I think we need to pursue that angle unless somebody has a better idea.

So here's a patch that does that.  It turns out to be pretty simple: I
just moved the DetermineSafeOldestOffset() calls around.  As far as I
can see, and that may not be far enough at this hour of the morning,
we just need two: one in StartupMultiXact, so that we initialize the
values correctly after reading the control file; and then another at
the very end of TruncateMultiXact, so we update it after each
checkpoint or restartpoint.  This leaves the residual problem that
autovacuum doesn't directly advance the stop point - the following
checkpoint does.  We could handle that by requesting a checkpoint if
oldestMultiXactId is ahead of lastCheckpointedOldest by enough that a
checkpoint would free up some space, although one might hope that
people won't be living on the edge to quite that degree.

As things are, I believe this sequence is possible:

1. The members SLRU is full all the way up to offsetStopLimit.
2. A checkpoint occurs, reaching MultiXactSetSafeTruncate(), which
sets lastCheckpointedOldest.
3. Vacuum runs, calling SetMultiXactIdLimit(), calling
DetermineSafeOldestOffset(), advancing
MultiXactState->offsetStopLimit.
4. Since offsetStopLimit > lastCheckpointedOffset, it's now possible
for someone to consume an MXID greater than offsetStopLimit, making
MultiXactState->nextOffset > lastCheckpointedOffset
5. The checkpoint from step 1, continuing on its merry way, now calls
TruncateMultiXact(), which sets rangeEnd > rangeStart and blows away
nearly every file in the SLRU.

I haven't confirmed this yet, so I might still be all wet, especially
since it is late here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
On Fri, May 8, 2015 at 6:25 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> 1. The members SLRU is full all the way up to offsetStopLimit.
> 2. A checkpoint occurs, reaching MultiXactSetSafeTruncate(), which
> sets lastCheckpointedOldest.
> 3. Vacuum runs, calling SetMultiXactIdLimit(), calling
> DetermineSafeOldestOffset(), advancing
> MultiXactState->offsetStopLimit.
> 4. Since offsetStopLimit > lastCheckpointedOffset, it's now possible
> for someone to consume an MXID greater than offsetStopLimit, making
> MultiXactState->nextOffset > lastCheckpointedOffset
> 5. The checkpoint from step 1, continuing on its merry way, now calls
> TruncateMultiXact(), which sets rangeEnd > rangeStart and blows away
> nearly every file in the SLRU.

I am still working on reproducing this race scenario various different
ways including the way you described, but at step 4 I kept getting
stuck, unable to create new multixacts despite having vacuum-frozen
all databases (including template0) and advanced the cluster minimum
mxid.

I think I see why, and I think it's a bug:  if you vacuum freeze all
your databases, MultiXactState->oldestMultiXactId finishes up equal to
MultiXactState->nextMXact.  But that's not actually a multixact that
exists yet, so when when DetermineSafeOldestOffset calls
find_multixact_start, it reads a garbage offset (all zeros in practice
since pages start out zeroed) and produces a garbage value for
offsetStopLimit which might incorrectly stop you from creating any
more multixacts even though member space is entirely empty (but it
depends on where your nextOffset happens to be at the time).  I think
the fix is something like "if nextMXact == oldestMultiXactId, then
there are no active multixacts, so the offsetStopLimit should be set
to nextOffset - (a segment's worth)".

--
Thomas Munro
http://www.enterprisedb.com
Thomas Munro wrote:

> I think I see why, and I think it's a bug:  if you vacuum freeze all
> your databases, MultiXactState->oldestMultiXactId finishes up equal to
> MultiXactState->nextMXact.

Uhm, that's rather silly.

> But that's not actually a multixact that exists yet, so when when
> DetermineSafeOldestOffset calls find_multixact_start, it reads a
> garbage offset (all zeros in practice since pages start out zeroed)
> and produces a garbage value for offsetStopLimit which might
> incorrectly stop you from creating any more multixacts even though
> member space is entirely empty (but it depends on where your
> nextOffset happens to be at the time).

Right.

> I think the fix is something like "if nextMXact == oldestMultiXactId,
> then there are no active multixacts, so the offsetStopLimit should be
> set to nextOffset - (a segment's worth)".

Makes sense.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, May 8, 2015 at 9:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Thomas Munro wrote:
>> I think I see why, and I think it's a bug:  if you vacuum freeze all
>> your databases, MultiXactState->oldestMultiXactId finishes up equal to
>> MultiXactState->nextMXact.
>
> Uhm, that's rather silly.
>
>> But that's not actually a multixact that exists yet, so when when
>> DetermineSafeOldestOffset calls find_multixact_start, it reads a
>> garbage offset (all zeros in practice since pages start out zeroed)
>> and produces a garbage value for offsetStopLimit which might
>> incorrectly stop you from creating any more multixacts even though
>> member space is entirely empty (but it depends on where your
>> nextOffset happens to be at the time).
>
> Right.
>
>> I think the fix is something like "if nextMXact == oldestMultiXactId,
>> then there are no active multixacts, so the offsetStopLimit should be
>> set to nextOffset - (a segment's worth)".
>
> Makes sense.

Here's a patch that attempts to implement this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment
On Sat, May 9, 2015 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, May 8, 2015 at 9:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> Thomas Munro wrote:
>>> I think the fix is something like "if nextMXact == oldestMultiXactId,
>>> then there are no active multixacts, so the offsetStopLimit should be
>>> set to nextOffset - (a segment's worth)".
>>
>> Makes sense.
>
> Here's a patch that attempts to implement this.

Thanks.  I think I have managed to reproduce something like the data
loss race that we were speculating about.

0.  initdb, autovacuum = off, set up explode_mxact_members.c as
described elsewhere in the thread.
1.  Fill up the members SLRU completely (ie reach state where you can
no longer create a new multixact of any size).  pg_multixact/members
contains 82040 files and the last one is named 14077.
2.  Issue CHECKPOINT, but use a debugger to stop inside
TruncateMultiXact after it has read
MultiXactState->lastCheckpointedOldest and released the lock, but
before it calls SlruScanDirectory to delete files...
3.  Run VACUUM FREEZE in all databases (including template0).  datminmxid moves.
4.  Create lots of new multixacts.  pg_multixact/members now contains
82041 files and the last one is named 14078 (ie one extra segment,
with the highest possible segment number, which couldn't be created
before vacuuming because of the one segment gap enforced by
DetermineSafeOldestOffset).  Segments 0000-0016 have new modified
times.
5.  ... allow the checkpoint started in step 2 to continue.  It
deletes segments, keeping only 0000-0016.  The segment 14078 which
contained active member data has been incorrectly deleted.

--
Thomas Munro
http://www.enterprisedb.com
On May 9, 2015, at 8:00 AM, Thomas Munro <thomas.munro@enterprisedb.com> wro=
te:
>> On Sat, May 9, 2015 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote=
:
>>> On Fri, May 8, 2015 at 9:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com=
> wrote:
>>> Thomas Munro wrote:
>>>> I think the fix is something like "if nextMXact =3D=3D oldestMultiXactI=
d,
>>>> then there are no active multixacts, so the offsetStopLimit should be
>>>> set to nextOffset - (a segment's worth)".
>>>=20
>>> Makes sense.
>>=20
>> Here's a patch that attempts to implement this.
>=20
> Thanks.  I think I have managed to reproduce something like the data
> loss race that we were speculating about.
>=20
> 0.  initdb, autovacuum =3D off, set up explode_mxact_members.c as
> described elsewhere in the thread.
> 1.  Fill up the members SLRU completely (ie reach state where you can
> no longer create a new multixact of any size).  pg_multixact/members
> contains 82040 files and the last one is named 14077.
> 2.  Issue CHECKPOINT, but use a debugger to stop inside
> TruncateMultiXact after it has read
> MultiXactState->lastCheckpointedOldest and released the lock, but
> before it calls SlruScanDirectory to delete files...
> 3.  Run VACUUM FREEZE in all databases (including template0).  datminmxid m=
oves.
> 4.  Create lots of new multixacts.  pg_multixact/members now contains
> 82041 files and the last one is named 14078 (ie one extra segment,
> with the highest possible segment number, which couldn't be created
> before vacuuming because of the one segment gap enforced by
> DetermineSafeOldestOffset).  Segments 0000-0016 have new modified
> times.
> 5.  ... allow the checkpoint started in step 2 to continue.  It
> deletes segments, keeping only 0000-0016.  The segment 14078 which
> contained active member data has been incorrectly deleted.

OK. So the next question is: if you then apply the other patch, does that pr=
event step 4 and thereby avoid catastrophe?

...Robert=
On Sun, May 10, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On May 9, 2015, at 8:00 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
>>> On Sat, May 9, 2015 at 2:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> On Fri, May 8, 2015 at 9:55 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>>>> Thomas Munro wrote:
>>>>> I think the fix is something like "if nextMXact == oldestMultiXactId,
>>>>> then there are no active multixacts, so the offsetStopLimit should be
>>>>> set to nextOffset - (a segment's worth)".
>>>>
>>>> Makes sense.
>>>
>>> Here's a patch that attempts to implement this.
>>
>> Thanks.  I think I have managed to reproduce something like the data
>> loss race that we were speculating about.
>>
>> 0.  initdb, autovacuum = off, set up explode_mxact_members.c as
>> described elsewhere in the thread.
>> 1.  Fill up the members SLRU completely (ie reach state where you can
>> no longer create a new multixact of any size).  pg_multixact/members
>> contains 82040 files and the last one is named 14077.
>> 2.  Issue CHECKPOINT, but use a debugger to stop inside
>> TruncateMultiXact after it has read
>> MultiXactState->lastCheckpointedOldest and released the lock, but
>> before it calls SlruScanDirectory to delete files...
>> 3.  Run VACUUM FREEZE in all databases (including template0).  datminmxid moves.
>> 4.  Create lots of new multixacts.  pg_multixact/members now contains
>> 82041 files and the last one is named 14078 (ie one extra segment,
>> with the highest possible segment number, which couldn't be created
>> before vacuuming because of the one segment gap enforced by
>> DetermineSafeOldestOffset).  Segments 0000-0016 have new modified
>> times.
>> 5.  ... allow the checkpoint started in step 2 to continue.  It
>> deletes segments, keeping only 0000-0016.  The segment 14078 which
>> contained active member data has been incorrectly deleted.
>
> OK. So the next question is: if you then apply the other patch, does that prevent step 4 and thereby avoid
catastrophe?

Yes, in a quick test, at step 4 I couldn't proceed.  I need to prod
this some more on Monday, and also see how it interacts with
autovacuum's view of what work needs to be done.

Here is my attempt at a summary.  In master, we have 3 arbitrarily
overlapping processes:

1.  VACUUM advances oldest multixact and member tail.
2.  CHECKPOINT observes member tail and head (separately) and then
deletes storage.
3.  Regular transaction obverses tail, checks boundary and advances head.

Information flows from 1 to 2, from 3 to 2, and from 1 to 3.  2
doesn't have a consistent view of head and tail, and doesn't prevent
them moving while deleting storage, so the effect is that we can
delete the wrong range of storage.

With the patch, we have 3 arbitrarily overlapping processes:

1.  VACUUM advances oldest multixact.
2.  CHECKPOINT observes oldest multixact, deletes storage and then
advances member tail.
3.  Regular transaction observes member tail, checks boundary and
advances member head.

Information flows from 1 to 2 and from 2 to 3.  Although 2 works with
a snapshot of the oldest multixact which may move before it deletes
storage, 2 knows that the member tail isn't moving (that is its job),
and that 3 can't move the head past the the tail (or rather the stop
limit which is the tail minus a gap), so the effect of using an out of
date oldest multixact is that we err on the side of being too
conservative with our allocation of member space, which is good.

I suppose you could have a workload that eats member space really fast
and checkpoints too infrequently so that you run out of space before a
checkpoint advances the tail.  I think that is why you were suggesting
triggering checkpoints automatically in some cases.  But I think that
would be a pretty insane workload (I can't convince my computer to
consume 2^32 member elements in under a couple of hours using the
pathological explode_mxact_members.c workload, and you can't set
checkpoint time above 1 hour).

--
Thomas Munro
http://www.enterprisedb.com
On Fri, May 8, 2015 at 10:46 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> I think the fix is something like "if nextMXact == oldestMultiXactId,
>>> then there are no active multixacts, so the offsetStopLimit should be
>>> set to nextOffset - (a segment's worth)".
>>
>> Makes sense.
>
> Here's a patch that attempts to implement this.

I've now committed and back-patched this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sun, May 10, 2015 at 9:41 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sun, May 10, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> OK. So the next question is: if you then apply the other patch, does that prevent step 4 and thereby avoid
catastrophe?
>
> Yes, in a quick test, at step 4 I couldn't proceed.  I need to prod
> this some more on Monday, and also see how it interacts with
> autovacuum's view of what work needs to be done.

The code in master which handles regular autovacuums seems correct
with this patch, because it measures member space usage by calling
find_multixact_start itself with the oldest multixact ID (it's not
dependent on anything that is updated at checkpoint time).

The code in the patch at
http://www.postgresql.org/message-id/CA+TgmobbaQpE6sNqT30+rz4UMH5aPraq20gko5xd2ZGajz1-Jg@mail.gmail.com
would become wrong though, because it would use the (new) variable
MultiXactState->oldestOffset (set at checkpoint) to measure the used
member space.  That means it would repeatedly launch autovacuums, even
after clearing away the problem and advancing the oldest multixact ID,
until the next checkpoint updates that value.  In other words, it
can't see its own progress immediately (which is the right approach
for blocking new multixact generation, ie defer until
checkpoint/truncation, but the wrong approach for triggering
autovacuums).

I think vacuum (SetMultiXactIdLimit?) needs to update oldestOffset,
not checkpoint (DetermineSafeOldestOffset).  (The reason for wanting
this new value in shared memory is because GetNextMultiXactId needs to
be able to check it cheaply for every call, so calling
find_multixact_start every time would presumably not fly).

--
Thomas Munro
http://www.enterprisedb.com
On Mon, May 11, 2015 at 2:45 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Sun, May 10, 2015 at 9:41 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Sun, May 10, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> OK. So the next question is: if you then apply the other patch, does that prevent step 4 and thereby avoid
catastrophe?
>>
>> Yes, in a quick test, at step 4 I couldn't proceed.  I need to prod
>> this some more on Monday, and also see how it interacts with
>> autovacuum's view of what work needs to be done.
>
> The code in master which handles regular autovacuums seems correct
> with this patch, because it measures member space usage by calling
> find_multixact_start itself with the oldest multixact ID (it's not
> dependent on anything that is updated at checkpoint time).
>
> The code in the patch at
> http://www.postgresql.org/message-id/CA+TgmobbaQpE6sNqT30+rz4UMH5aPraq20gko5xd2ZGajz1-Jg@mail.gmail.com
> would become wrong though, because it would use the (new) variable
> MultiXactState->oldestOffset (set at checkpoint) to measure the used
> member space.  That means it would repeatedly launch autovacuums, even
> after clearing away the problem and advancing the oldest multixact ID,
> until the next checkpoint updates that value.  In other words, it
> can't see its own progress immediately (which is the right approach
> for blocking new multixact generation, ie defer until
> checkpoint/truncation, but the wrong approach for triggering
> autovacuums).
>
> I think vacuum (SetMultiXactIdLimit?) needs to update oldestOffset,
> not checkpoint (DetermineSafeOldestOffset).  (The reason for wanting
> this new value in shared memory is because GetNextMultiXactId needs to
> be able to check it cheaply for every call, so calling
> find_multixact_start every time would presumably not fly).

Here's a new version of the patch to do that.  As before, it tracks
the oldest offset in shared memory, but now that is updated in
SetMultiXactIdLimit, so it is always updated at the same time as
MultiXactState->oldestMultiXactId (at startup and after full scan
vacuums).

The value is used in the following places:

1.  GetNewMultiXactId uses it to see if it needs to send
PMSIGNAL_START_AUTOVAC_LAUNCHER to request autovacuums even if
autovacuum is set to off.  That is the main purpose of this patch.
(GetNewMultiXactId *doesn't* use it for member wraparound prevention:
that is based on offsetStopLimit, set by checkpoint code after
truncation of physical storage.)

2.  SetMultiXactIdLimit itself also uses it to send a
PMSIGNAL_START_AUTOVAC_LAUNCHER signal to the postmaster (according to
comments this allows immediately doing some more vacuuming upon
completion if necessary).

3.  ReadMultiXactCounts, called by regular vacuum and autovacuum,
rather than doing its own call to find_multixact_start, now also reads
it from shared memory.  (Incidentally the code this replaces suffers
from the problem fixed elsewhere it can call find_multixact_start for
a multixact that doesn't exist yet).

Vacuum runs as expected with with autovacuum off.  Do you think we
should be using MULTIXACT_MEMBER_DANGER_THRESHOLD as the trigger level
for forced vacuums instead of MULTIXACT_MEMBER_SAFE_THRESHOLD, or
something else?

--
Thomas Munro
http://www.enterprisedb.com

Attachment
On Mon, May 11, 2015 at 7:56 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> On Mon, May 11, 2015 at 2:45 PM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> On Sun, May 10, 2015 at 9:41 AM, Thomas Munro
>> <thomas.munro@enterprisedb.com> wrote:
>>> On Sun, May 10, 2015 at 12:43 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>>> OK. So the next question is: if you then apply the other patch, does that prevent step 4 and thereby avoid
catastrophe?
>>>
>>> Yes, in a quick test, at step 4 I couldn't proceed.  I need to prod
>>> this some more on Monday, and also see how it interacts with
>>> autovacuum's view of what work needs to be done.
>>
>> The code in master which handles regular autovacuums seems correct
>> with this patch, because it measures member space usage by calling
>> find_multixact_start itself with the oldest multixact ID (it's not
>> dependent on anything that is updated at checkpoint time).
>>
>> The code in the patch at
>> http://www.postgresql.org/message-id/CA+TgmobbaQpE6sNqT30+rz4UMH5aPraq20gko5xd2ZGajz1-Jg@mail.gmail.com
>> would become wrong though, because it would use the (new) variable
>> MultiXactState->oldestOffset (set at checkpoint) to measure the used
>> member space.  That means it would repeatedly launch autovacuums, even
>> after clearing away the problem and advancing the oldest multixact ID,
>> until the next checkpoint updates that value.  In other words, it
>> can't see its own progress immediately (which is the right approach
>> for blocking new multixact generation, ie defer until
>> checkpoint/truncation, but the wrong approach for triggering
>> autovacuums).
>>
>> I think vacuum (SetMultiXactIdLimit?) needs to update oldestOffset,
>> not checkpoint (DetermineSafeOldestOffset).  (The reason for wanting
>> this new value in shared memory is because GetNextMultiXactId needs to
>> be able to check it cheaply for every call, so calling
>> find_multixact_start every time would presumably not fly).
>
> Here's a new version of the patch to do that.  As before, it tracks
> the oldest offset in shared memory, but now that is updated in
> SetMultiXactIdLimit, so it is always updated at the same time as
> MultiXactState->oldestMultiXactId (at startup and after full scan
> vacuums).
>
> The value is used in the following places:
>
> 1.  GetNewMultiXactId uses it to see if it needs to send
> PMSIGNAL_START_AUTOVAC_LAUNCHER to request autovacuums even if
> autovacuum is set to off.  That is the main purpose of this patch.
> (GetNewMultiXactId *doesn't* use it for member wraparound prevention:
> that is based on offsetStopLimit, set by checkpoint code after
> truncation of physical storage.)
>
> 2.  SetMultiXactIdLimit itself also uses it to send a
> PMSIGNAL_START_AUTOVAC_LAUNCHER signal to the postmaster (according to
> comments this allows immediately doing some more vacuuming upon
> completion if necessary).
>
> 3.  ReadMultiXactCounts, called by regular vacuum and autovacuum,
> rather than doing its own call to find_multixact_start, now also reads
> it from shared memory.  (Incidentally the code this replaces suffers
> from the problem fixed elsewhere it can call find_multixact_start for
> a multixact that doesn't exist yet).
>
> Vacuum runs as expected with with autovacuum off.

Great.  I've committed this and back-patched it with 9.3, after making
your code look a little more like what I already committed for the
same task, and whacking the comments around.

> Do you think we
> should be using MULTIXACT_MEMBER_DANGER_THRESHOLD as the trigger level
> for forced vacuums instead of MULTIXACT_MEMBER_SAFE_THRESHOLD, or
> something else?

No, I think you have it right.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company