Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1 - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
Date
Msg-id 20150602125600.GJ30287@alap3.anarazel.de
Whole thread Raw
In response to Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: Re: [GENERAL] 9.4.1 -> 9.4.2 problem: could not access status of transaction 1
List pgsql-hackers
On 2015-06-01 14:22:32 -0400, Robert Haas wrote:
> On Mon, Jun 1, 2015 at 4:58 AM, Andres Freund <andres@anarazel.de> wrote:
> > The lack of WAL logging actually has caused problems in the 9.3.3 (?)
> > era, where we didn't do any truncation during recovery...
>
> Right, but now we're piggybacking on the checkpoint records, and I
> don't have any evidence that this approach can't be made robust.  It's
> possible that it can't be made robust, but that's not currently clear.

Well, it's possible that it can be made work without problems. But I
think robust is something different. Having to look at slrus, during
recovery, to find out what to truncate puts more intelligence/complexity
in the recovery path than I'm comfortable with.

> >> By the time we've reached the minimum recovery point, they will have
> >> been recreated by the same WAL records that created them in the first
> >> place.
> >
> > I'm not sure that's true. I think we could end up errorneously removing
> > files that were included in the base backup. Anyway, let's focus on your
> > patch for now.
>
> OK, but I am interested in discussing the other thing too.  I just
> can't piece together the scenario myself - there may well be one.  The
> base backup will begin replay from the checkpoint caused by
> pg_start_backup() and remove anything that wasn't there at the start
> of the backup.  But all of that stuff should get recreated by the time
> we reach the minimum recovery point (end of backup).

I'm not sure if it's reprouceably borked. What about this scenario:
1) pg_start_backup() is called, creates a checkpoint.
2) 2**31 multixacts are created, possibly with several checkpoints
   inbetween
3) pg_multixact is copied
4) basebackup finishes

Unless I'm missing something this will lead to a crash recovery startup
where the first call to TruncateMultiXact() will have
MultiXactState->lastCheckpointedOldest wildly inconsistent with
GetOldestMultiXactOnDisk() return value. Possibly leading to truncation
being skipped errorneously.  Whether that's a problem I'm not yet
entirely sure.

But what *definitely* looks wrong to me is that a TruncateMultiXact() in
this scenario now (since a couple weeks ago) does a
SimpleLruReadPage_ReadOnly() in the members slru via
find_multixact_start(). That just won't work acceptably when we're not
yet consistent. There very well could not be a valid members segment at
that point?  Am I missing something?

> > I'm more worried about the cases where we didn't ever actually "badly
> > wrap around" (i.e. overwrite needed data); but where that's not clear on
> > the standby because the base backup isn't in a consistent state.
>
> I agree. The current patch tries to make it so that we never call
> find_multixact_start() while in recovery, but it doesn't quite
> succeed: the call in TruncateMultiXact still happens during recovery,
> but only once we're sure that the mxact we plan to call it on actually
> exists on disk.  That won't be called until we replay the first
> checkpoint, but that might still be prior to consistency.

It'll pretty much *always* be before we reach consistency, right? It'll
called on the checkpoint created by pg_start_backup()?

I don't think the presence check (that's GetOldestMultiXactOnDisk() in
TruncateMultiXact(), right) helps very much. There's no guarantee at all
that offsets and members are in any way consistent with each other. Or
in themselves for that matter, the copy could very well have been in the
middle of a write the slru page.

I think at the very least we'll have to skip this step while not yet
consistent. That really sucks, because we'll possibly end up with
multixacts that are completely filled by the time we've reached
consistency.

Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: pg_xlog -> pg_xjournal?
Next
From: Fabien COELHO
Date:
Subject: Re: checkpointer continuous flushing