Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files - Mailing list pgsql-hackers

From Stephen Frost
Subject Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files
Date
Msg-id 20180125123837.GH2416@tamriel.snowman.net
Whole thread Raw
In response to Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files
List pgsql-hackers
Michael, all,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Wed, Jan 24, 2018 at 12:43:51PM -0500, Stephen Frost wrote:
> > * chenhj (chjischj@163.com) wrote:
> >> At 2018-01-23 09:56:48, "Stephen Frost" <sfrost@snowman.net> wrote:
> >>> I've only read through the thread to try and understand what's going on
> >>> and the first thing that comes to mind is that you're changing
> >>> pg_rewind to not remove the WAL from before the divergence (split)
> >>> point, but I'm not sure why.  As noted, that WAL isn't needed for
> >>> anything (it's from before the split, after all), so why keep it?  Is
> >>> there something in this optimization that depends on the old WAL being
> >>> there and, if so, what and why?
> >>
> >> After run pg_rewind, the first startup of postgres will do crash recovery.
> >> And crash recovery will begin from the previous redo point preceding the divergence.
> >> So, the WAL after the redo point and before the divergence is needed.
> >
> > Right.
>
> Most of the time, and particularly since v11 has removed the need to
> retain more past segments than one completed checkpoint, those segments
> have less chances to be on the source server, limiting more the impact
> of the patch discussed on this thread.

Good point.

> >>> That's also different from how pg_basebackup works, which I don't think
> >>> is good (seems like pg_rewind should operate in a pretty similar manner
> >>> to pg_basebackup).
> >>
> >> Thanks for your comments!
> >> I also considered copy WAL just like how pg_basebackup does,but a
> >> implement similar to pg_basebackup's manner may be not so simple.
> >
> > Using the replication protocol to fetch WAL would be a good thing to do
> > (actually, making pg_rewind entirely work through a connection to the
> > current primary would be great) but that's independent of what I'm
> > asking for here.  Here I'm just suggesting that we not change what
> > pg_rewind is doing today when it comes to the existing WAL on the
> > old-primary.
>
> Yes, superuser is necessary now, if we could get to a point where only a
> replication permission is needed that would be nice. Now we could do
> things differently. We could have a system role dedicated to pg_rewind
> which works only on the functions from genfile.c that pg_rewind needs,
> in order to leverage the need of a superuser.

Actually, the other work I'm doing nearby wrt removing the explicit
superuser() checks in those functions would allow a non-superuser role
to be created which could be used by pg_rewind.  We could even add an
explicit 'pg_rewind' default role if we wanted to, but is that the route
we want to go with this or should we be thinking about changing
pg_rewind to use the replication protocol and a replication user
instead..?  The only reason that it doesn't today is that there isn't an
easy way for it to do so with the existing replication protocol, as I
understand it.

Then again, there's this whole question of if we should even keep the
replication protocol or if we should be getting rid of it in favor of
have regular connections that can support replication.  There's been
discussion of that recently, as I recall, though I can't remember where.

> >> And the WAL which contains the previous redo point preceding the
> >> divergence may be only exists in target server and had been recycled
> >> in source. That's different between pg_rewind and pg_basebackup.
> >
> > Hm, pg_rewind was removing that and expecting it to be on the new
> > primary?  If that's the case then I could see an argument for keeping
> > WAL that's from the divergence point onward, but I still don't think
> > we should have pg_rewind just leave all of the prior WAL in place.
>
> Another thing that we could as well do is simply not fetching any WAL
> files at all during a rewind, then let the startup process of the
> rewound server decide by itself what it needs. This would leverage the
> data transfered in all cases. It is easy to define the start point of
> WAL segments needed for a rewound server because the last checkpoint
> record before WAL forked is calculated before transferring any data. The
> finish point cannot be exact though because you don't know up to which
> point you should transfer it. In some ways, this is close to a base
> backup. We could as well define an end point to minimize the amount of
> WAL as the last completed segment before data transfer begins, but then
> you need to worry about WAL segment holes and such. At the end of the
> day, just not transferring any data from pg_wal looks more solid to me
> as a first step if we need to worry about data that is transferred but
> finishes by being useless.

Having the rewound server decide by itself what it needs actually sounds
like it would be better and would allow whatever the restore command is
to fetch any missing WAL necessary (which it might be able to do more
efficiently, from an archive server and possibly even in parallel as
compared to pg_rewind doing it serially from the primary...).  We don't
have any particularly easy way to prevent needed WAL from disappearing
off of the primary anyway, so it seems like it would be necessary to
have a working restore command for pg_rewind to be reliable.  While we
could create a physical replication slot for pg_rewind when it starts,
that may already be too late.

Thanks!

Stephen

Attachment

pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: [Patch] Make block and file size for WAL and relations defined atcluster creation
Next
From: David Rowley
Date:
Subject: Re: [HACKERS] Removing [Merge]Append nodes which contain a single subpath