Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated) - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)
Date
Msg-id CAEepm=2AUwgy0dZMAXsQZPiRYAqW7x1k0kUbd5nZYUjCbthzQw@mail.gmail.com
Whole thread Raw
In response to Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: Re: BUG #12990: Missing pg_multixact/members files (appears to have wrapped, then truncated)  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-bugs
On Tue, Apr 21, 2015 at 12:25 PM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Hi Alvaro
>
> On Tue, Apr 21, 2015 at 7:04 AM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
>> Here's a patch.  I have tested locally and it closes the issue for me.
>> If those affected can confirm that it stops the file removal from
>> happening, I'd appreciate it.
>
> I was also starting to look at this problem.  For what it's worth,
> here's a client program that I used to generate a lot of multixact
> members.  The patch seems to work correctly so far: as the offset
> approached wraparound, I saw the warnings first with appropriate OID
> and members remaining, and then I was blocked from creating new
> multixacts.

One thing I noticed about your patch is that it effectively halves the
amount of multixact members you can have on disk.  Sure, I'd rather
hit an error at 2^31 members than a corrupt database at 2^32 members,
but I wondered if we should try to allow the full range to be used.
I'm not sure whether there is a valid use case for such massive
amounts of pg_multixact/members data (or at least one that won't go
away if autovacuum heuristics are changed in a later patch, also I
understand that there are other recent patches that reduce member
traffic), but I if the plan is to backpatch this patch then I suppose
it should ideally not halve the amount of an important resource you
can use in existing system when people do a point upgrade.

Here's a small patch (that applies after your patch) to show how this
could be done, using three-way comparisons with an explicit boundary
to detect wraparound.  There may be other technical problems (for
example MultiXactAdvanceNextMXact still uses the
MultiXactOffsetPrecedes), or this may be a bad idea just because it
breaks with the well convention for wrap around detection established
by xids.

Also, I wanted to make sure I could reproduce the original
bug/corruption in unpatched master with the client program I posted.
Here are my notes on doing that (sorry if they belabour the obvious,
partly this is just me learning how SLRUs and multixacts work...):

========

Member wraparound happens after segment file "14078" (assuming default
page size, you get 32 pages per segment, and 1636 members per page
(409 groups of 4 + some extra data), and our max member offset wraps
after 0xffffffff, and 0xffffffff / 1636 / 32 = 82040 = 0x14078;
incidentally that final segment is a shorter one).

Using my test client with 500 sessions and 35k loops I observed this,
it wrapped back around to writing to member file "0000" after creating
"14078", which is obviously broken, because the start of member
segment "0000" holds members for multixact ID 1, which was still in
play (it was datminmxid for template0).

Looking at the members of multixact ID 1 I see recent xids:

postgres=# select pg_get_multixact_members('1'::xid);
 pg_get_multixact_members
--------------------------
 (34238661,sh)
 (34238662,sh)
(2 rows)

Note that pg_get_multixact_members knows the correct *number* of
members for multixact ID 1, it's just that it's looking at members
from some much later multixact.  By a tedious binary search I found
it:

postgres=# select pg_get_multixact_members('17094780'::xid);
 pg_get_multixact_members
--------------------------
 ... snip ...
 (34238660,sh)
 (34238661,sh) <-- here they are!
 (34238662,sh) <--
 (34238663,sh)
 ... snip ...

After a checkpoint, I saw that all the files got deleted except for a
few consecutively named files starting at "0000", which would be
correct behavior in general, if we hadn't allowed the member offset to
wrap.  It had correctly kept the segments starting with the one
holding the members of multixact ID 1 (the cluster-wide oldest) up
until the one that corresponds to MultiXactState->nextOffset.  My test
program had blown right past member offset 0xffffffff and back to 0
and then kept going.  The truncation code isn't the problem per se.

To produce the specific error message seen by the bug reporter via
normal interactions from a test program, I think we need some magic
that I can't figure out how to do yet: we need to run a query that
accesses a multixact that has member offset from before offset
wraparound, eg 0xffffffff or similar, but whose members are not on a
page that is still in memory, after a checkpoint that has unlinked the
segment file, so it can try to load it and discover that the segment
file is missing!  So a pretty complex interaction of concurrent
processes, timing and caches.

We can more artificially stimulate the error by explicitly asking for
multixact members like this though:

postgres=# select pg_get_multixact_members('10000000'::xid);
ERROR:  could not access status of transaction 10000000
DETAIL:  Could not open file "pg_multixact/members/BB55": No such file
or directory.

That's a totally valid multixact ID, obviously since it's been able to
figure out which segment to look in for its members.

Here's one that tries to open the segment that comes immediately
before "0000" in modulo numbering:

postgres=# select pg_get_multixact_members('17094770'::xid);
ERROR:  could not access status of transaction 17094770
DETAIL:  Could not open file "pg_multixact/members/14078": No such
file or directory.

If I tried it with 17094779, the multixact ID immediatly before the
one that has overwritten "0000", it does actually work, presumably
because its pages happen to be buffered for me so it doesn't try to
open the file (guessing here).

I don't currently believe it's necessary to reproduce that step via a
test program anyway, the root problem is clear enough just from
watching the thing wrap.

--
Thomas Munro
http://www.enterprisedb.com

Attachment

pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #13128: Postgres deadlock on startup failure when max_prepared_transactions is not sufficiently high.
Next
From: Heikki Linnakangas
Date:
Subject: Re: BUG #13128: Postgres deadlock on startup failure when max_prepared_transactions is not sufficiently high.