Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby - Mailing list pgsql-hackers

From Jeff Janes
Subject Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
Date
Msg-id CAMkU=1wKqmmvCYWWDb1Pip8G1YZY7BOiSSNasqvUMEN5N_rpLQ@mail.gmail.com
Whole thread Raw
Responses Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
List pgsql-hackers
(Redirected from Bugs to Hackers, as it seems not to be the same bug
as the original one.)

On Mon, Jun 9, 2014 at 3:50 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Jeff Janes wrote:
>
>> This patch seems to solve a problem I've also been having with non-existent
>> "pg_multixact/members/13D35" files in my testing of torn page write and fk
>> locks again recent 9.4.  However, sometimes I've found the problem even
>> before multi wrap around occured, and my reading of this thread suggests
>> that the current issue would not show up in that case, so I don't know how
>> much comfort to take in the knowledge that the problem seems to have gone
>> away--as they say, things that go away by themselves without explanation
>> also tend to come back by themselves.
>
> I think this problem is rather related to the code touched by commit
> 722acf51a0d074d19782ad7e97ebe3fdb63fbb87, which is about removing
> unneeded files in pg_multixact/members.  I checked your files, and it
> seems that pg_multixact/members files that should still exist are gone;
> for instance, pg_controldata says that the oldest multi is 2586314407,
> but if I run
> select * from pg_get_multixact_members('that value')
> I get an error saying that pg_multixact/members/218B does not exist.
>
> It seems that there are a lot of files that were improperly removed.
> I expect that existing files should be sequential, but there are a lot
> of holes in a listing of pg_multixact/members.

I think the holes are because a truncation was in progress and
interrupted by a 'pg_ctl stop -mi' before it completed. The files are
considered in an apparently random order, so if it is interrupted
midway through it will leave a collection of holes that would have
been closed up had it been allowed to continue.

I instrumented the code to log file deletion (attached patch) and I
see that the it starts complaining about the absence of 5183 just a
fraction of a second after it was deleted, so the deletion must have
been inappropriate.

12538  2014-06-17 12:10:02.925 PDT:LOG:  JJ deleting 0xb66b20 5183
12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL:  Could not open file
"pg_multixact/members/5183": No such file or directory.
12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL:  Could not open file
"pg_multixact/members/5183": No such file or directory.
12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL:  Could not open file
"pg_multixact/members/5183": No such file or directory.

This problem was initially fairly easy to reproduce, but since I
started adding instrumentation specifically to catch it, it has become
devilishly hard to reproduce.

I think my next step will be to also log each of the values which goes
into the complex if (...) expression that decides on the deletion.

Cheers,

Jeff

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: comparison operators
Next
From: Tom Lane
Date:
Subject: Re: comparison operators