Re: Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby
Date
Msg-id 20140626044519.GJ7340@eldon.alvh.no-ip.org
Whole thread Raw
In response to Re: Re: [BUGS] BUG #8673: Could not open file "pg_multixact/members/xxxx" on slave during hot_standby  (Andres Freund <andres@2ndquadrant.com>)
List pgsql-hackers
Andres Freund wrote:
> On 2014-06-20 17:38:16 -0400, Alvaro Herrera wrote:

> > It seems to me that we need to keep the offsets files around until a
> > checkpoint has written the "oldest" number to WAL.  In other words we
> > need additional state in shared memory: (a) what we currently store
> > which is the oldest number as computed by vacuum (not safe to delete,
> > but it's the number that the next checkpoint must write), and (b) the
> > oldest number that the last checkpoint wrote (the safe deletion point).
> 
> Why not just WAL log truncations? If we'd emit the WAL record after
> determining the offsets page we should be safe I think? That seems like
> easier and more robust fix? And it's what e.g. the clog does.

Yes, I think this whole thing would be simpler if we just wal-logged the
truncations, like pg_clog does.  But I would like to avoid doing that
for now, and do it in 9.5 only in the future.  As a backpatchable (to
9.4/9.3) fix, I propose we do the following:

1. have vacuum update MultiXactState->oldestMultiXactId based on the
minimum value of pg_database->datminmxid.  Since this value is saved in
pg_control, it is restored from checkpoint replay during recovery.

2. Keep track of a new value, MultiXactState->lastCheckpointedOldest.
This value is updated by CreateCheckPoint in a primary server after the
checkpoint record has been flushed, and by xlog_redo in a hot standby, to
be the MultiXactState->oldestMultiXactId value that was last flushed.

3. TruncateMultiXact() no longer receives a parameter.  Files are
removed based on MultiXactState->lastCheckpointedOldest instead.  

4. call TruncateMultiXact at checkpoint time, after the checkpoint WAL
record has been flushed, and at restartpoint time (just like today).
This means we only remove files that a prior checkpoint has already
registered as being no longer necessary.  Also, if a recovery is
interrupted before end of WAL (recovery target), the files are still
present.  So we no longer truncate during vacuum.

Another consideration for (4) is that right now we're only invoking
multixact truncation in a primary when we're able to advance
pg_database.datminmxid (see vac_update_datfrozenxid).  The problem is
that after a crash and subsequent recovery, pg_database might be updated
without removing pg_multixact files; this would mean that the next
opportunity to remove files would be far in the future, when the minimum
datminmxid is advanced again.  One way to fix that would be to have
every single call to vac_update_datfrozenxid() attempt multixact
truncation, but that seems wasteful since I expect vacuuming is more
frequent than checkpointing.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Scaling shared buffer eviction
Next
From: Fujii Masao
Date:
Subject: Re: idle_in_transaction_timeout