Thread: Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby

(Redirected from Bugs to Hackers, as it seems not to be the same bug
as the original one.)

On Mon, Jun 9, 2014 at 3:50 PM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Jeff Janes wrote:
>
>> This patch seems to solve a problem I've also been having with non-existent
>> "pg_multixact/members/13D35" files in my testing of torn page write and fk
>> locks again recent 9.4.  However, sometimes I've found the problem even
>> before multi wrap around occured, and my reading of this thread suggests
>> that the current issue would not show up in that case, so I don't know how
>> much comfort to take in the knowledge that the problem seems to have gone
>> away--as they say, things that go away by themselves without explanation
>> also tend to come back by themselves.
>
> I think this problem is rather related to the code touched by commit
> 722acf51a0d074d19782ad7e97ebe3fdb63fbb87, which is about removing
> unneeded files in pg_multixact/members.  I checked your files, and it
> seems that pg_multixact/members files that should still exist are gone;
> for instance, pg_controldata says that the oldest multi is 2586314407,
> but if I run
> select * from pg_get_multixact_members('that value')
> I get an error saying that pg_multixact/members/218B does not exist.
>
> It seems that there are a lot of files that were improperly removed.
> I expect that existing files should be sequential, but there are a lot
> of holes in a listing of pg_multixact/members.

I think the holes are because a truncation was in progress and
interrupted by a 'pg_ctl stop -mi' before it completed. The files are
considered in an apparently random order, so if it is interrupted
midway through it will leave a collection of holes that would have
been closed up had it been allowed to continue.

I instrumented the code to log file deletion (attached patch) and I
see that the it starts complaining about the absence of 5183 just a
fraction of a second after it was deleted, so the deletion must have
been inappropriate.

12538  2014-06-17 12:10:02.925 PDT:LOG:  JJ deleting 0xb66b20 5183
12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL:  Could not open file
"pg_multixact/members/5183": No such file or directory.
12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL:  Could not open file
"pg_multixact/members/5183": No such file or directory.
12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL:  Could not open file
"pg_multixact/members/5183": No such file or directory.

This problem was initially fairly easy to reproduce, but since I
started adding instrumentation specifically to catch it, it has become
devilishly hard to reproduce.

I think my next step will be to also log each of the values which goes
into the complex if (...) expression that decides on the deletion.

Cheers,

Jeff

Attachment
Jeff Janes wrote:

> 12538  2014-06-17 12:10:02.925 PDT:LOG:  JJ deleting 0xb66b20 5183
> 12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL:  Could not open file
> "pg_multixact/members/5183": No such file or directory.
> 12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL:  Could not open file
> "pg_multixact/members/5183": No such file or directory.
> 12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL:  Could not open file
> "pg_multixact/members/5183": No such file or directory.
> 
> This problem was initially fairly easy to reproduce, but since I
> started adding instrumentation specifically to catch it, it has become
> devilishly hard to reproduce.

I think I see the problem here now, after letting this test rig run for
a while.

First, the fact that there are holes in members/ files because of the
random order in deletion, in itself, seems harmless, because the files
remaining in between will be deleted by a future vacuum.

Now, the real problem is that we delete files during vacuum, but the
state that marks those file as safely deletable is written as part of a
checkpoint record, not by vacuum itself (vacuum writes its state in
pg_database, but a checkpoint derives its info from a shared memory
variable.)  Taken together, this means that if there's a crash between
the vacuum that does a deletion and the next checkpoint, we might
attempt to read an offset file that is not supposed to be part of the
live range -- but we forgot that because we didn't reach the point where
we save the shmem state to disk.

It seems to me that we need to keep the offsets files around until a
checkpoint has written the "oldest" number to WAL.  In other words we
need additional state in shared memory: (a) what we currently store
which is the oldest number as computed by vacuum (not safe to delete,
but it's the number that the next checkpoint must write), and (b) the
oldest number that the last checkpoint wrote (the safe deletion point).

Another thing I noticed is that more than one vacuum process can try to
run deletion simultaneously, at least if they crash frequently while
they were doing deletion.  I don't see that this is troublesome, even
though they might attempt to delete the same files.

Finally, I noticed that we first read the oldest offset file, then
determine the member files to delete; then delete offset files, then
delete member files.  Which means that we would delete offset files that
correspond to member files that we keep (assuming there is a crash in
the middle of deletion).  It seems to me we should first delete members,
then delete offsets, if we wanted to be very safe about it.  I don't
think this really would matters much, if we were to do things safely as
described above.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



On 2014-06-20 17:38:16 -0400, Alvaro Herrera wrote:
> Jeff Janes wrote:
> 
> > 12538  2014-06-17 12:10:02.925 PDT:LOG:  JJ deleting 0xb66b20 5183
> > 12498 UPDATE 2014-06-17 12:10:03.188 PDT:DETAIL:  Could not open file
> > "pg_multixact/members/5183": No such file or directory.
> > 12561 UPDATE 2014-06-17 12:10:04.473 PDT:DETAIL:  Could not open file
> > "pg_multixact/members/5183": No such file or directory.
> > 12572 UPDATE 2014-06-17 12:10:04.475 PDT:DETAIL:  Could not open file
> > "pg_multixact/members/5183": No such file or directory.
> > 
> > This problem was initially fairly easy to reproduce, but since I
> > started adding instrumentation specifically to catch it, it has become
> > devilishly hard to reproduce.
> 
> I think I see the problem here now, after letting this test rig run for
> a while.
> 
> First, the fact that there are holes in members/ files because of the
> random order in deletion, in itself, seems harmless, because the files
> remaining in between will be deleted by a future vacuum.
> 
> Now, the real problem is that we delete files during vacuum, but the
> state that marks those file as safely deletable is written as part of a
> checkpoint record, not by vacuum itself (vacuum writes its state in
> pg_database, but a checkpoint derives its info from a shared memory
> variable.)  Taken together, this means that if there's a crash between
> the vacuum that does a deletion and the next checkpoint, we might
> attempt to read an offset file that is not supposed to be part of the
> live range -- but we forgot that because we didn't reach the point where
> we save the shmem state to disk.

> It seems to me that we need to keep the offsets files around until a
> checkpoint has written the "oldest" number to WAL.  In other words we
> need additional state in shared memory: (a) what we currently store
> which is the oldest number as computed by vacuum (not safe to delete,
> but it's the number that the next checkpoint must write), and (b) the
> oldest number that the last checkpoint wrote (the safe deletion point).

Why not just WAL log truncations? If we'd emit the WAL record after
determining the offsets page we should be safe I think? That seems like
easier and more robust fix? And it's what e.g. the clog does.

> Another thing I noticed is that more than one vacuum process can try to
> run deletion simultaneously, at least if they crash frequently while
> they were doing deletion.  I don't see that this is troublesome, even
> though they might attempt to delete the same files.

That actually seems to be bad to me. It might fail to fail, but still.

> Finally, I noticed that we first read the oldest offset file, then
> determine the member files to delete; then delete offset files, then
> delete member files.  Which means that we would delete offset files that
> correspond to member files that we keep (assuming there is a crash in
> the middle of deletion).  It seems to me we should first delete members,
> then delete offsets, if we wanted to be very safe about it.  I don't
> think this really would matters much, if we were to do things safely as
> described above.

Part of that seems to be solveable by WAL logging truncations. But I
also think that we need a more robust tracking of the oldest member
offset - we still aren't safe against member wraparounds. And I don't
see how we can be without explicitly tracking the oldest member instead
of the ugly 'search for oldest offset segment and map that to members'
thing we're doing now.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Andres Freund wrote:
> On 2014-06-20 17:38:16 -0400, Alvaro Herrera wrote:

> > It seems to me that we need to keep the offsets files around until a
> > checkpoint has written the "oldest" number to WAL.  In other words we
> > need additional state in shared memory: (a) what we currently store
> > which is the oldest number as computed by vacuum (not safe to delete,
> > but it's the number that the next checkpoint must write), and (b) the
> > oldest number that the last checkpoint wrote (the safe deletion point).
> 
> Why not just WAL log truncations? If we'd emit the WAL record after
> determining the offsets page we should be safe I think? That seems like
> easier and more robust fix? And it's what e.g. the clog does.

Yes, I think this whole thing would be simpler if we just wal-logged the
truncations, like pg_clog does.  But I would like to avoid doing that
for now, and do it in 9.5 only in the future.  As a backpatchable (to
9.4/9.3) fix, I propose we do the following:

1. have vacuum update MultiXactState->oldestMultiXactId based on the
minimum value of pg_database->datminmxid.  Since this value is saved in
pg_control, it is restored from checkpoint replay during recovery.

2. Keep track of a new value, MultiXactState->lastCheckpointedOldest.
This value is updated by CreateCheckPoint in a primary server after the
checkpoint record has been flushed, and by xlog_redo in a hot standby, to
be the MultiXactState->oldestMultiXactId value that was last flushed.

3. TruncateMultiXact() no longer receives a parameter.  Files are
removed based on MultiXactState->lastCheckpointedOldest instead.  

4. call TruncateMultiXact at checkpoint time, after the checkpoint WAL
record has been flushed, and at restartpoint time (just like today).
This means we only remove files that a prior checkpoint has already
registered as being no longer necessary.  Also, if a recovery is
interrupted before end of WAL (recovery target), the files are still
present.  So we no longer truncate during vacuum.

Another consideration for (4) is that right now we're only invoking
multixact truncation in a primary when we're able to advance
pg_database.datminmxid (see vac_update_datfrozenxid).  The problem is
that after a crash and subsequent recovery, pg_database might be updated
without removing pg_multixact files; this would mean that the next
opportunity to remove files would be far in the future, when the minimum
datminmxid is advanced again.  One way to fix that would be to have
every single call to vac_update_datfrozenxid() attempt multixact
truncation, but that seems wasteful since I expect vacuuming is more
frequent than checkpointing.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Jeff Janes wrote:

> This problem was initially fairly easy to reproduce, but since I
> started adding instrumentation specifically to catch it, it has become
> devilishly hard to reproduce.
> 
> I think my next step will be to also log each of the values which goes
> into the complex if (...) expression that decides on the deletion.

Could you please to reproduce it after updating to latest?  I pushed
fixes that should close these issues.  Maybe you want to remove the
instrumentation you added, to make failures more likely.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services