Thread: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"

9.3: more problems with "Could not open file "pg_multixact/members/xxxx"

From
Jeff Janes
Date:
On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Jeff Janes wrote:

> This problem was initially fairly easy to reproduce, but since I
> started adding instrumentation specifically to catch it, it has become
> devilishly hard to reproduce.
>
> I think my next step will be to also log each of the values which goes
> into the complex if (...) expression that decides on the deletion.

Could you please to reproduce it after updating to latest?  I pushed
fixes that should close these issues.  Maybe you want to remove the
instrumentation you added, to make failures more likely.

There are still some problems in 9.4, but I haven't been able to diagnose them and wanted to do more research on it.  The announcement of upcoming back-branches for 9.3 spurred me to try it there, and I have problems with 9.3 (12c5bbdcbaa292b2a4b09d298786) as well.  The move of truncation to the checkpoint seems to have made the problem easier to reproduce.  On an 8 core machine, this test fell over after about 20 minutes, which is much faster than it usually reproduces.

This the error I get:

2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR:  could not access status of transaction 85837221
2084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL:  Could not open file "pg_multixact/members/14031": No such file or directory.
2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."foo_parent" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"

The testing harness is attached as 3 patches that must be made to the test server, and 2 scripts. The script do.sh sets up the database (using fixed paths, so be careful) and then invokes count.pl in a loop to do the actual work.


Cheers,

Jeff
Attachment

Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"

From
Alvaro Herrera
Date:
I'm not saying there is no multixact bug here, but I wonder if this part
of your crasher patch might be the cause:

--- 754,771 ----                  errmsg("could not seek to block %u in file \"%s\": %m",
blocknum,FilePathName(v->mdfd_vfd)))); 
 
!         if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress()) {
!       nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);
!         ereport(FATAL,
!                 (errcode(ERRCODE_DISK_FULL),
!                  errmsg("could not write block %u of relation %s: wrote only %d of %d bytes",
!                         blocknum,
!                         relpath(reln->smgr_rnode, forknum),
!                         nbytes, BLCKSZ),
!                  errhint("JJ is screwing with the database.")));
!         } else {
!       nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ);
!     }

Wouldn't this BLCKSZ/3 business update the page's LSN but not the full
contents, meaning that on xlog replay the block wouldn't be rewritten
when the xlog replays next time around?  That could cause the block to
have the upper two thirds containing multixacts in xmax that had been
removed by a vacuuming round previous to the crash.

(Maybe I'm just too tired and I'm failing to fully understand the torn
page protection.  I thought I understood how it worked, but now I'm not
sure -- I mean I don't see how it can possibly have any value at all.
Surely if the disk writes the first 512-byte sector of the page and then
forgets the updates to the next 15 sectors, the page will appear as not
needing the full page image to be restored ...)

Is the page containing the borked multixact value the one that was
half-written by this code?

Is the problem reproducible if you cause this path to ereport(FATAL)
without writing 1/3rd of the page?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



On Wed, Jul 16, 2014 at 12:46 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> (Maybe I'm just too tired and I'm failing to fully understand the torn
> page protection.  I thought I understood how it worked, but now I'm not
> sure -- I mean I don't see how it can possibly have any value at all.
> Surely if the disk writes the first 512-byte sector of the page and then
> forgets the updates to the next 15 sectors, the page will appear as not
> needing the full page image to be restored ...)

We always restore full page images, regardless of the page LSN.
Otherwise, we'd have exactly the problem you describe.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



On Tue, Jul 15, 2014 at 3:58 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
Jeff Janes wrote:

> This problem was initially fairly easy to reproduce, but since I
> started adding instrumentation specifically to catch it, it has become
> devilishly hard to reproduce.
>
> I think my next step will be to also log each of the values which goes
> into the complex if (...) expression that decides on the deletion.

Could you please to reproduce it after updating to latest?  I pushed
fixes that should close these issues.  Maybe you want to remove the
instrumentation you added, to make failures more likely.

There are still some problems in 9.4, but I haven't been able to diagnose them and wanted to do more research on it.  The announcement of upcoming back-branches for 9.3 spurred me to try it there, and I have problems with 9.3 (12c5bbdcbaa292b2a4b09d298786) as well.  The move of truncation to the checkpoint seems to have made the problem easier to reproduce.  On an 8 core machine, this test fell over after about 20 minutes, which is much faster than it usually reproduces.

This the error I get:

2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR:  could not access status of transaction 85837221
2084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL:  Could not open file "pg_multixact/members/14031": No such file or directory.
2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT:  SQL statement "SELECT 1 FROM ONLY "public"."foo_parent" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"

The testing harness is attached as 3 patches that must be made to the test server, and 2 scripts. The script do.sh sets up the database (using fixed paths, so be careful) and then invokes count.pl in a loop to do the actual work.

Sorry, after a long time when I couldn't do much testing on this, I've now been able to get back to it.

It looks like what is happening is that  checkPoint.nextMultiOffset wraps around from 2^32 to 0, even if 0 is still being used.  At that point it starts deleting member files that are still needed.

Is there some interlock which is supposed to prevent from  checkPoint.nextMultiOffset rom lapping iself?  I haven't been able to find it.  It seems like the interlock applies only to MultiXid, not the Offset.

Thanks,

Jeff

Re: 9.3: more problems with "Could not open file "pg_multixact/members/xxxx"

From
Andres Freund
Date:
On August 19, 2014 10:24:20 PM CEST, Jeff Janes <jeff.janes@gmail.com> wrote:
>On Tue, Jul 15, 2014 at 3:58 PM, Jeff Janes <jeff.janes@gmail.com>
>wrote:
>
>> On Fri, Jun 27, 2014 at 11:51 AM, Alvaro Herrera
><alvherre@2ndquadrant.com
>> > wrote:
>>
>>> Jeff Janes wrote:
>>>
>>> > This problem was initially fairly easy to reproduce, but since I
>>> > started adding instrumentation specifically to catch it, it has
>become
>>> > devilishly hard to reproduce.
>>> >
>>> > I think my next step will be to also log each of the values which
>goes
>>> > into the complex if (...) expression that decides on the deletion.
>>>
>>> Could you please to reproduce it after updating to latest?  I pushed
>>> fixes that should close these issues.  Maybe you want to remove the
>>> instrumentation you added, to make failures more likely.
>>>
>>
>> There are still some problems in 9.4, but I haven't been able to
>diagnose
>> them and wanted to do more research on it.  The announcement of
>upcoming
>> back-branches for 9.3 spurred me to try it there, and I have problems
>with
>> 9.3 (12c5bbdcbaa292b2a4b09d298786) as well.  The move of truncation
>to the
>> checkpoint seems to have made the problem easier to reproduce.  On an
>8
>> core machine, this test fell over after about 20 minutes, which is
>much
>> faster than it usually reproduces.
>>
>> This the error I get:
>>
>> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:ERROR:  could not access
>status of
>> transaction 85837221
>> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:DETAIL:  Could not open file
>> "pg_multixact/members/14031": No such file or directory.
>> 2084 UPDATE 2014-07-15 15:26:20.608 PDT:CONTEXT:  SQL statement
>"SELECT 1
>> FROM ONLY "public"."foo_parent" x WHERE "id" OPERATOR(pg_catalog.=)
>$1 FOR
>> KEY SHARE OF x"
>>
>> The testing harness is attached as 3 patches that must be made to the
>test
>> server, and 2 scripts. The script do.sh sets up the database (using
>fixed
>> paths, so be careful) and then invokes count.pl in a loop to do the
>> actual work.
>>
>
>Sorry, after a long time when I couldn't do much testing on this, I've
>now
>been able to get back to it.
>
>It looks like what is happening is that  checkPoint.nextMultiOffset
>wraps
>around from 2^32 to 0, even if 0 is still being used.  At that point it
>starts deleting member files that are still needed.
>
>Is there some interlock which is supposed to prevent from
>checkPoint.nextMultiOffset rom lapping iself?  I haven't been able to
>find
>it.  It seems like the interlock applies only to MultiXid, not the
>Offset.

There is none (and there never has been one either). I've complained about it a couple of times but nobody, me
included,had time and energy to fix that :(
 

Andres


--- 
Please excuse brevity and formatting - I am writing this on my mobile phone.