Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 - Mailing list pgsql-general

On 2015-05-29 15:08:11 -0400, Robert Haas wrote:
> It seems pretty clear that we can't effectively determine anything
> about member wraparound until the cluster is consistent.

I wonder if this doesn't actually hints at a bigger problem.  Currently,
to determine where we need to truncate SlruScanDirectory() is
used. That, afaics, could actually be a problem during recovery when
we're not consistent.

Consider the scenario where a very large database is copied while
running. At the start of the backup we'll determine at which checkpoint
recovery will start and store it in the label. After that the copy will
start, copying everything slowly. That works because we expect recovery
to fix things up.  The problem I see WRT multixacts is that the copied
state of pg_multixact could be wildly different from the one at the
label's checkpoint. During recovery, before reaching the first
checkpoint, we'll create multixact files as used at the time of the
checkpoint. But the rest of pg_multixact may be ahead 2**31 xacts.

For clog and such that's not a problem because the truncation
etc. points are all stored in WAL - during recovery we just replay the
truncations that happened on the master, there's no need to look at the
data directory. And we won't access the clog before being consistent.

But for multixacts is different. To avoid ending up with
pg_multixact/*/* directories we have to do truncations during
recovery. As there's currently no truncation records we have to do that
scanning the data directory. But that state could be "from the future".

I considered for a second whether the solution for that could be to not
truncate while inconsistent - but I think that doesn't solve anything as
then we can end up with directories where every single offsets/member
file exists.  We could possibly try to fix that by always truncating
away slru segments in offsets that we know to be too old to exist in a
valid database. But achieving the same for members fries my brain.  It
also seems awfully risky.

I think at least for 9.5+ we should a) invent proper truncation records
for pg_multixact b) start storing oldestValidMultiOffset in pg_control.
The current hack of scanning the directories to get knowledge we should
have is a pretty bad hack, and we should not continue using it forever.
I think we might end up needing to do a) even in the backbranches.


Am I missing something?


This problem isn't conflicting with most of the fixes you describe, so
I'll continue with reviewing those.


Greetings,

Andres Freund


pgsql-general by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] Re: 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Next
From: Daniel Begin
Date:
Subject: Re: Planner cost adjustments