Thread: Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WALfiles

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WALfiles

From
chenhj
Date:
On 2017-09-30 02:17:54,"Alexander Korotkov" <a.korotkov@postgrespro.ru> wrote:

Great.  Now code of this patch looks good for me.
However, we forgot about documentation.

  <para>
   The result is equivalent to replacing the target data directory with the
   source one. Only changed blocks from relation files are copied;
   all other files are copied in full, including configuration files. The
   advantage of <application>pg_rewind</> over taking a new base backup, or
   tools like <application>rsync</>, is that <application>pg_rewind</> does
   not require reading through unchanged blocks in the cluster. This makes
   it a lot faster when the database is large and only a small
   fraction of blocks differ between the clusters.
  </para>

At least, this paragraph need to be adjusted, because it states whose files are copied.  And probably latter paragraphs whose state about WAL files.



Your are rigth.
I wrote a draft as following, but i'm afraid whether the english statement is accurate.

----------------------
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index d5430d4..bcd094b 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -50,9 +50,12 @@ PostgreSQL documentation
   <para>
    The result is equivalent to replacing the target data directory with the
    source one. Only changed blocks from relation files are copied;
-   all other files are copied in full, including configuration files. The
-   advantage of <application>pg_rewind</> over taking a new base backup, or
-   tools like <application>rsync</>, is that <application>pg_rewind</> does
+   all other files except WAL are copied in full, including configuration
+   files. Only the WAL files between the point of divergence and the current
+   WAL insert location of the source server are copied, for other WAL files
+   are useless for the target server. The advantage of
+   <application>pg_rewind</> over taking a new base backup, or tools
+   like <application>rsync</>, is that <application>pg_rewind</> does
    not require reading through unchanged blocks in the cluster. This makes
    it a lot faster when the database is large and only a small
    fraction of blocks differ between the clusters.
@@ -231,7 +234,7 @@ PostgreSQL documentation
      <para>
       Copy all other files such as <filename>pg_xact</filename> and
       configuration files from the source cluster to the target cluster
-      (everything except the relation files).
+      (everything except the relation files and some WAL files).
      </para>
     </step>
     <step>

------
Best Regars,
Chen Huajun


Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

From
Alexander Korotkov
Date:
On Sat, Sep 30, 2017 at 8:18 PM, chenhj <chjischj@163.com> wrote:
On 2017-09-30 02:17:54,"Alexander Korotkov" <a.korotkov@postgrespro.ruwrote:

Great.  Now code of this patch looks good for me.
However, we forgot about documentation.

  <para>
   The result is equivalent to replacing the target data directory with the
   source one. Only changed blocks from relation files are copied;
   all other files are copied in full, including configuration files. The
   advantage of <application>pg_rewind</> over taking a new base backup, or
   tools like <application>rsync</>, is that <application>pg_rewind</> does
   not require reading through unchanged blocks in the cluster. This makes
   it a lot faster when the database is large and only a small
   fraction of blocks differ between the clusters.
  </para>

At least, this paragraph need to be adjusted, because it states whose files are copied.  And probably latter paragraphs whose state about WAL files.



Your are rigth.
I wrote a draft as following, but i'm afraid whether the english statement is accurate.

I'm not native english speaker too :(

Only the WAL files between the point of divergence and the current WAL insert location of the source server are copied, *for* other WAL files are useless for the target server. 

I'm not sure about this usage of word *for*.  For me, it probably should be just removed.  Rest of changes looks good for me.  Please, integrate them into the patch.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WALfiles

From
chenhj
Date:
On  2017-10-01 04:09:19,"Alexander Korotkov" <a.korotkov@postgrespro.ru> wrote:
On Sat, Sep 30, 2017 at 8:18 PM, chenhj <chjischj@163.com> wrote:
On 2017-09-30 02:17:54,"Alexander Korotkov" <a.korotkov@postgrespro.ruwrote:

Great.  Now code of this patch looks good for me.
However, we forgot about documentation.

  <para>
   The result is equivalent to replacing the target data directory with the
   source one. Only changed blocks from relation files are copied;
   all other files are copied in full, including configuration files. The
   advantage of <application>pg_rewind</> over taking a new base backup, or
   tools like <application>rsync</>, is that <application>pg_rewind</> does
   not require reading through unchanged blocks in the cluster. This makes
   it a lot faster when the database is large and only a small
   fraction of blocks differ between the clusters.
  </para>

At least, this paragraph need to be adjusted, because it states whose files are copied.  And probably latter paragraphs whose state about WAL files.



Your are rigth.
I wrote a draft as following, but i'm afraid whether the english statement is accurate.

I'm not native english speaker too :(

Only the WAL files between the point of divergence and the current WAL insert location of the source server are copied, *for* other WAL files are useless for the target server. 

I'm not sure about this usage of word *for*.  For me, it probably should be just removed.  Rest of changes looks good for me.  Please, integrate them into the patch.


I had removed the *for* , Pleae check the new patch again.

---
Best Regards,
Chen Huajun
Attachment

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

From
Alexander Korotkov
Date:
On Sun, Oct 1, 2017 at 8:27 PM, chenhj <chjischj@163.com> wrote:
On  2017-10-01 04:09:19,"Alexander Korotkov" <a.korotkov@postgrespro.ru> wrote:
On Sat, Sep 30, 2017 at 8:18 PM, chenhj <chjischj@163.com> wrote:
On 2017-09-30 02:17:54,"Alexander Korotkov" <a.korotkov@postgrespro.ru> wrote:

Great.  Now code of this patch looks good for me.
However, we forgot about documentation.

  <para>
   The result is equivalent to replacing the target data directory with the
   source one. Only changed blocks from relation files are copied;
   all other files are copied in full, including configuration files. The
   advantage of <application>pg_rewind</> over taking a new base backup, or
   tools like <application>rsync</>, is that <application>pg_rewind</> does
   not require reading through unchanged blocks in the cluster. This makes
   it a lot faster when the database is large and only a small
   fraction of blocks differ between the clusters.
  </para>

At least, this paragraph need to be adjusted, because it states whose files are copied.  And probably latter paragraphs whose state about WAL files.



Your are rigth.
I wrote a draft as following, but i'm afraid whether the english statement is accurate.

I'm not native english speaker too :(

Only the WAL files between the point of divergence and the current WAL insert location of the source server are copied, *for* other WAL files are useless for the target server. 

I'm not sure about this usage of word *for*.  For me, it probably should be just removed.  Rest of changes looks good for me.  Please, integrate them into the patch.


I had removed the *for* , Pleae check the new patch again.

Now, this patch looks good for me.  It applies cleanly, builds cleanly, passes regression tests, new functionality is covered by regression tests.  Code is OK for me and docs too.

I'm marking this patch as "Ready for committer".  BTW, authors field in the commitfest app is empty (https://commitfest.postgresql.org/15/1302/).  Please, put your name there.

I hope this patch will be committed during 2017-11 commitfest.  Be ready to rebase this patch if needed.  Thank you for your work.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WALfiles

From
chenhj
Date:
On 2017-10-02 23:24:30,"Alexander Korotkov" <a.korotkov@postgrespro.ru> wrote:
On Sun, Oct 1, 2017 at 8:27 PM, chenhj <chjischj@163.com> wrote:


Now, this patch looks good for me.  It applies cleanly, builds cleanly, passes regression tests, new functionality is covered by regression tests.  Code is OK for me and docs too.

I'm marking this patch as "Ready for committer".  BTW, authors field in the commitfest app is empty (https://commitfest.postgresql.org/15/1302/).  Please, put your name there.

I hope this patch will be committed during 2017-11 commitfest.  Be ready to rebase this patch if needed.  Thank you for your work.

I had filled the authors field of this patch in commitfest, and will rebase this patch if needed. Thank you for your help!

------
Best Regards,
Chen Huajun



Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WALfiles

From
chenhj
Date:

At 2017-12-01 12:27:09, "Michael Paquier" <michael.paquier@gmail.com> wrote:
>On Tue, Oct 3, 2017 at 1:20 AM, chenhj <chjischj@163.com> wrote:
>> I had filled the authors field of this patch in commitfest, and will rebase
>> this patch if needed. Thank you for your help!
>
>The documentation of the patch needs a rebase, so I am moving it to
>next CF with "waiting on author" as status.
>
>$ git diff master --check
>src/bin/pg_rewind/pg_rewind.c:292: trailing whitespace.
>+    *
>There are whitespace complains.
>-- 
>Michael

Rebased and removed the whitespace.

----
regards
Chen Huajun
Attachment

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

From
Stephen Frost
Date:
Greetings!

* chenhj (chjischj@163.com) wrote:
> Rebased and removed the  whitespace.

Thanks for working on this, I agree that it seems like a pretty cool
optimization for pg_rewind.

I've only read through the thread to try and understand what's going on
and the first thing that comes to mind is that you're changing
pg_rewind to not remove the WAL from before the divergence (split)
point, but I'm not sure why.  As noted, that WAL isn't needed for
anything (it's from before the split, after all), so why keep it?  Is
there something in this optimization that depends on the old WAL being
there and, if so, what and why?

That's also different from how pg_basebackup works, which I don't think
is good (seems like pg_rewind should operate in a pretty similar manner
to pg_basebackup).

Setting this back to Waiting for Author but hopefully you can reply soon
and clarify that, possibly adjusting the patch accordingly.

Thanks!

Stephen

Attachment

Re:Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WALfiles

From
chenhj
Date:

At 2018-01-23 09:56:48, "Stephen Frost" <sfrost@snowman.net> wrote: > >I've only read through the thread to try and understand what's going on >and the first thing that comes to mind is that you're changing >pg_rewind to not remove the WAL from before the divergence (split) >point, but I'm not sure why. As noted, that WAL isn't needed for >anything (it's from before the split, after all), so why keep it? Is >there something in this optimization that depends on the old WAL being >there and, if so, what and why?

After run pg_rewind, the first startup of postgres will do crash recovery. And crash recovery will begin from the previous redo point preceding the divergence. So, the WAL after the redo point and before the divergence is needed.
Of course, the WAL before the redo point is not needed, but in my point of view, recycling unwanted WAL does not have to be done by pg_rewind. reference code: pg_rewind.c: main(int argc, char **argv) { ... findLastCheckpoint(datadir_target, divergerec, lastcommontliIndex, &chkptrec, &chkpttli, &chkptredo); ... createBackupLabel(chkptredo, chkpttli, chkptrec); ... } ... createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpointloc) { ...len = snprintf(buf, sizeof(buf), "START WAL LOCATION: %X/%X (file %s)\n" "CHECKPOINT LOCATION: %X/%X\n" "BACKUP METHOD: pg_rewind\n" "BACKUP FROM: standby\n" "START TIME: %s\n",/* omit LABEL: line */ (uint32) (startpoint >> 32), (uint32) startpoint, xlogfilename, (uint32) (checkpointloc >> 32), (uint32) checkpointloc, strfbuf); ... }


>That's also different from how pg_basebackup works, which I don't think >is good (seems like pg_rewind should operate in a pretty similar manner >to pg_basebackup). >
Thanks for your comments! I also considered copy WAL just like how pg_basebackup does,but a implement similar to pg_basebackup's manner may be not so simple.
Twice transport of files from source to target may be needed,first for data files, and another for WAL. And the WAL which contains the previous redo point preceding the divergence may be only exists in target server and had been recycled in source. That's different between pg_rewind and pg_basebackup.

Regards,
Chen Huajun



Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

From
Stephen Frost
Date:
Greetings,

* chenhj (chjischj@163.com) wrote:
>
> At 2018-01-23 09:56:48, "Stephen Frost" <sfrost@snowman.net> wrote:
> >I've only read through the thread to try and understand what's going on
> >and the first thing that comes to mind is that you're changing
> >pg_rewind to not remove the WAL from before the divergence (split)
> >point, but I'm not sure why.  As noted, that WAL isn't needed for
> >anything (it's from before the split, after all), so why keep it?  Is
> >there something in this optimization that depends on the old WAL being
> >there and, if so, what and why?
>
> After run pg_rewind, the first startup of postgres will do crash recovery.
> And crash recovery will begin from the previous redo point preceding the divergence.
> So, the WAL after the redo point and before the divergence is needed.

Right.

> Of course, the WAL before the redo point is not needed, but in my point of view,
> recycling unwanted WAL does not have to be done by pg_rewind.

That's what pg_rewind has been doing though, isn't it?  And it's not
like that WAL is useful for anything, is it?  That's also how
pg_basebackup works.

> >That's also different from how pg_basebackup works, which I don't think
> >is good (seems like pg_rewind should operate in a pretty similar manner
> >to pg_basebackup).
>
> Thanks for your comments!
> I also considered copy WAL just like how pg_basebackup does,but a implement similar to pg_basebackup's manner may be
notso simple. 

Using the replication protocol to fetch WAL would be a good thing to do
(actually, making pg_rewind entirely work through a connection to the
current primary would be great) but that's independent of what I'm
asking for here.  Here I'm just suggesting that we not change what
pg_rewind is doing today when it comes to the existing WAL on the
old-primary.

> And the WAL which contains the previous redo point preceding the divergence may be only exists in target server and
hadbeen recycled in source. That's different between pg_rewind and pg_basebackup. 

Hm, pg_rewind was removing that and expecting it to be on the new
primary?  If that's the case then I could see an argument for keeping
WAL that's from the divergence point onward, but I still don't think we
should have pg_rewind just leave all of the prior WAL in place.

Thanks!

Stephen

Attachment

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

From
Michael Paquier
Date:
On Wed, Jan 24, 2018 at 12:43:51PM -0500, Stephen Frost wrote:
> * chenhj (chjischj@163.com) wrote:
>> At 2018-01-23 09:56:48, "Stephen Frost" <sfrost@snowman.net> wrote:
>>> I've only read through the thread to try and understand what's going on
>>> and the first thing that comes to mind is that you're changing
>>> pg_rewind to not remove the WAL from before the divergence (split)
>>> point, but I'm not sure why.  As noted, that WAL isn't needed for
>>> anything (it's from before the split, after all), so why keep it?  Is
>>> there something in this optimization that depends on the old WAL being
>>> there and, if so, what and why?
>>
>> After run pg_rewind, the first startup of postgres will do crash recovery.
>> And crash recovery will begin from the previous redo point preceding the divergence.
>> So, the WAL after the redo point and before the divergence is needed.
>
> Right.

Most of the time, and particularly since v11 has removed the need to
retain more past segments than one completed checkpoint, those segments
have less chances to be on the source server, limiting more the impact
of the patch discussed on this thread.

>> Of course, the WAL before the redo point is not needed, but in my point of view,
>> recycling unwanted WAL does not have to be done by pg_rewind.
>
> That's what pg_rewind has been doing though, isn't it?  And it's not
> like that WAL is useful for anything, is it?  That's also how
> pg_basebackup works.

As of HEAD, pg_rewind handles data in pg_wal similarly to other
paths which are not relation files: the files from the source are just
blindly copied to the target. After the rewind and once recovery begins,
we just let the startup process do the cleanup instead of
pg_rewind. Regarding that pg_basebackup is different, you get the choice
to do what you want using --wal-method, and you actually get the
segments that you only need to get a self-contained base backup.

>>> That's also different from how pg_basebackup works, which I don't think
>>> is good (seems like pg_rewind should operate in a pretty similar manner
>>> to pg_basebackup).
>>
>> Thanks for your comments!
>> I also considered copy WAL just like how pg_basebackup does,but a
>> implement similar to pg_basebackup's manner may be not so simple.
>
> Using the replication protocol to fetch WAL would be a good thing to do
> (actually, making pg_rewind entirely work through a connection to the
> current primary would be great) but that's independent of what I'm
> asking for here.  Here I'm just suggesting that we not change what
> pg_rewind is doing today when it comes to the existing WAL on the
> old-primary.

Yes, superuser is necessary now, if we could get to a point where only a
replication permission is needed that would be nice. Now we could do
things differently. We could have a system role dedicated to pg_rewind
which works only on the functions from genfile.c that pg_rewind needs,
in order to leverage the need of a superuser.

>> And the WAL which contains the previous redo point preceding the
>> divergence may be only exists in target server and had been recycled
>> in source. That's different between pg_rewind and pg_basebackup.
>
> Hm, pg_rewind was removing that and expecting it to be on the new
> primary?  If that's the case then I could see an argument for keeping
> WAL that's from the divergence point onward, but I still don't think
> we should have pg_rewind just leave all of the prior WAL in place.

Another thing that we could as well do is simply not fetching any WAL
files at all during a rewind, then let the startup process of the
rewound server decide by itself what it needs. This would leverage the
data transfered in all cases. It is easy to define the start point of
WAL segments needed for a rewound server because the last checkpoint
record before WAL forked is calculated before transferring any data. The
finish point cannot be exact though because you don't know up to which
point you should transfer it. In some ways, this is close to a base
backup. We could as well define an end point to minimize the amount of
WAL as the last completed segment before data transfer begins, but then
you need to worry about WAL segment holes and such. At the end of the
day, just not transferring any data from pg_wal looks more solid to me
as a first step if we need to worry about data that is transferred but
finishes by being useless.
--
Michael

Attachment

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

From
Stephen Frost
Date:
Michael, all,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Wed, Jan 24, 2018 at 12:43:51PM -0500, Stephen Frost wrote:
> > * chenhj (chjischj@163.com) wrote:
> >> At 2018-01-23 09:56:48, "Stephen Frost" <sfrost@snowman.net> wrote:
> >>> I've only read through the thread to try and understand what's going on
> >>> and the first thing that comes to mind is that you're changing
> >>> pg_rewind to not remove the WAL from before the divergence (split)
> >>> point, but I'm not sure why.  As noted, that WAL isn't needed for
> >>> anything (it's from before the split, after all), so why keep it?  Is
> >>> there something in this optimization that depends on the old WAL being
> >>> there and, if so, what and why?
> >>
> >> After run pg_rewind, the first startup of postgres will do crash recovery.
> >> And crash recovery will begin from the previous redo point preceding the divergence.
> >> So, the WAL after the redo point and before the divergence is needed.
> >
> > Right.
>
> Most of the time, and particularly since v11 has removed the need to
> retain more past segments than one completed checkpoint, those segments
> have less chances to be on the source server, limiting more the impact
> of the patch discussed on this thread.

Good point.

> >>> That's also different from how pg_basebackup works, which I don't think
> >>> is good (seems like pg_rewind should operate in a pretty similar manner
> >>> to pg_basebackup).
> >>
> >> Thanks for your comments!
> >> I also considered copy WAL just like how pg_basebackup does,but a
> >> implement similar to pg_basebackup's manner may be not so simple.
> >
> > Using the replication protocol to fetch WAL would be a good thing to do
> > (actually, making pg_rewind entirely work through a connection to the
> > current primary would be great) but that's independent of what I'm
> > asking for here.  Here I'm just suggesting that we not change what
> > pg_rewind is doing today when it comes to the existing WAL on the
> > old-primary.
>
> Yes, superuser is necessary now, if we could get to a point where only a
> replication permission is needed that would be nice. Now we could do
> things differently. We could have a system role dedicated to pg_rewind
> which works only on the functions from genfile.c that pg_rewind needs,
> in order to leverage the need of a superuser.

Actually, the other work I'm doing nearby wrt removing the explicit
superuser() checks in those functions would allow a non-superuser role
to be created which could be used by pg_rewind.  We could even add an
explicit 'pg_rewind' default role if we wanted to, but is that the route
we want to go with this or should we be thinking about changing
pg_rewind to use the replication protocol and a replication user
instead..?  The only reason that it doesn't today is that there isn't an
easy way for it to do so with the existing replication protocol, as I
understand it.

Then again, there's this whole question of if we should even keep the
replication protocol or if we should be getting rid of it in favor of
have regular connections that can support replication.  There's been
discussion of that recently, as I recall, though I can't remember where.

> >> And the WAL which contains the previous redo point preceding the
> >> divergence may be only exists in target server and had been recycled
> >> in source. That's different between pg_rewind and pg_basebackup.
> >
> > Hm, pg_rewind was removing that and expecting it to be on the new
> > primary?  If that's the case then I could see an argument for keeping
> > WAL that's from the divergence point onward, but I still don't think
> > we should have pg_rewind just leave all of the prior WAL in place.
>
> Another thing that we could as well do is simply not fetching any WAL
> files at all during a rewind, then let the startup process of the
> rewound server decide by itself what it needs. This would leverage the
> data transfered in all cases. It is easy to define the start point of
> WAL segments needed for a rewound server because the last checkpoint
> record before WAL forked is calculated before transferring any data. The
> finish point cannot be exact though because you don't know up to which
> point you should transfer it. In some ways, this is close to a base
> backup. We could as well define an end point to minimize the amount of
> WAL as the last completed segment before data transfer begins, but then
> you need to worry about WAL segment holes and such. At the end of the
> day, just not transferring any data from pg_wal looks more solid to me
> as a first step if we need to worry about data that is transferred but
> finishes by being useless.

Having the rewound server decide by itself what it needs actually sounds
like it would be better and would allow whatever the restore command is
to fetch any missing WAL necessary (which it might be able to do more
efficiently, from an archive server and possibly even in parallel as
compared to pg_rewind doing it serially from the primary...).  We don't
have any particularly easy way to prevent needed WAL from disappearing
off of the primary anyway, so it seems like it would be necessary to
have a working restore command for pg_rewind to be reliable.  While we
could create a physical replication slot for pg_rewind when it starts,
that may already be too late.

Thanks!

Stephen

Attachment

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

From
Michael Paquier
Date:
On Thu, Jan 25, 2018 at 07:38:37AM -0500, Stephen Frost wrote:
> Michael, all,
>
> * Michael Paquier (michael.paquier@gmail.com) wrote:
>> On Wed, Jan 24, 2018 at 12:43:51PM -0500, Stephen Frost wrote:
>>> * chenhj (chjischj@163.com) wrote:
>>>> At 2018-01-23 09:56:48, "Stephen Frost" <sfrost@snowman.net> wrote:
>>>>> I've only read through the thread to try and understand what's going on
>>>>> and the first thing that comes to mind is that you're changing
>>>>> pg_rewind to not remove the WAL from before the divergence (split)
>>>>> point, but I'm not sure why.  As noted, that WAL isn't needed for
>>>>> anything (it's from before the split, after all), so why keep it?  Is
>>>>> there something in this optimization that depends on the old WAL being
>>>>> there and, if so, what and why?
>>>>
>>>> After run pg_rewind, the first startup of postgres will do crash recovery.
>>>> And crash recovery will begin from the previous redo point preceding the divergence.
>>>> So, the WAL after the redo point and before the divergence is needed.
>>>
>>> Right.
>>
>> Most of the time, and particularly since v11 has removed the need to
>> retain more past segments than one completed checkpoint, those segments
>> have less chances to be on the source server, limiting more the impact
>> of the patch discussed on this thread.
>
> Good point.

Actually, that is worse that than, because after doing a promotion one
has to issue a checkpoint on the promoted server so as to update its
control file (pg_rewind fetches the new TLI number from the on-disk
file), so we can never ensure that needed WAL segments will be on the
primary. Hence the proposed patch loses most of its value, because there
will always be a WAL segment hole anyway.

It can also help in reducing the load in creating new segments if
min_wal_size is large enough on the rewound server but that is a narrow
benefit.

In short, it seems really to me that we should reject the approach as
proposed, and replace it with something that prevents the fetching of
any WAL segments from the source server. I think that we could consider
as well removing all WAL segments on the primary from the point WAL
forked, as those created between the last checkpoint before WAL forked
up to the point where WAL forked are useful anyway. But those are bonus
points to keep a minimalistic amount of space for the rewound node as
they finish by being recycled anyway. For those reasons I think that the
patch should be marked as returned with feedback.

>> Yes, superuser is necessary now, if we could get to a point where only a
>> replication permission is needed that would be nice. Now we could do
>> things differently. We could have a system role dedicated to pg_rewind
>> which works only on the functions from genfile.c that pg_rewind needs,
>> in order to leverage the need of a superuser.
>
> Actually, the other work I'm doing nearby wrt removing the explicit
> superuser() checks in those functions would allow a non-superuser role
> to be created which could be used by pg_rewind.  We could even add an
> explicit 'pg_rewind' default role if we wanted to, but is that the route
> we want to go with this or should we be thinking about changing
> pg_rewind to use the replication protocol and a replication user
> instead..?  The only reason that it doesn't today is that there isn't an
> easy way for it to do so with the existing replication protocol, as I
> understand it.

Yes, actually with the piece of work we are doing, we don't need to work
more on the replication protocol, and pg_rewind does not require heavy
refactoring either, so from a maintenance point of view we would gain
much more in the long term by using a dedicated role. If your patch to
have the file-read catalog user gets in, we should add in the docs of
pg_rewind a small set of ACL commands to allow pg_rewind to work with a
non-superuser role as long is it gains access to the minimal set of
functions necessary.

> Then again, there's this whole question of if we should even keep the
> replication protocol or if we should be getting rid of it in favor of
> have regular connections that can support replication.  There's been
> discussion of that recently, as I recall, though I can't remember
> where.

Hm. I don't recall this one..

> Having the rewound server decide by itself what it needs actually sounds
> like it would be better and would allow whatever the restore command is
> to fetch any missing WAL necessary (which it might be able to do more
> efficiently, from an archive server and possibly even in parallel as
> compared to pg_rewind doing it serially from the primary...).  We don't
> have any particularly easy way to prevent needed WAL from disappearing
> off of the primary anyway, so it seems like it would be necessary to
> have a working restore command for pg_rewind to be reliable.  While we
> could create a physical replication slot for pg_rewind when it starts,
> that may already be too late.

The only approach I can think of that makes unnecessary an external
archive is to have a slot that guarantees that WAL segments down to the
previous checkpoint before promotion are kept around to allow a rewound
node to use them. For planned failovers, things can be controlled even
today as you know when to create a slot. Unplanned failovers are where
the problems come. One way I think can address any problems is to have
what I would call a failover slot. The concept is simple: when a node is
in recovery have it create a new physical replication slot at each
restart point which has a user-defined name, controlled by a GUC, and do
the creation automatically. If the parameter is empty, no slots are
created. At the beginning of a restart point, the slot is removed, so
once the node is promoted the slot will not get removed, and it would
keep around WAL segments down to the last redo point before promotion
automatically.

When you don't want to set up a WAL archive, but still want to ensure
that a rewound node is able to catch up with its promoted primary, just
enable this failover slot and then set up primary_slot_name in
recovery.conf to use it.
--
Michael

Attachment

Re: [HACKERS] [PATCH]make pg_rewind to not copy useless WAL files

From
Michael Paquier
Date:
On Fri, Jan 26, 2018 at 11:36:09AM +0900, Michael Paquier wrote:
> In short, it seems really to me that we should reject the approach as
> proposed, and replace it with something that prevents the fetching of
> any WAL segments from the source server. I think that we could consider
> as well removing all WAL segments on the primary from the point WAL
> forked, as those created between the last checkpoint before WAL forked
> up to the point where WAL forked are useful anyway. But those are bonus
> points to keep a minimalistic amount of space for the rewound node as
> they finish by being recycled anyway. For those reasons I think that the
> patch should be marked as returned with feedback.

Hearing nothing and as the commit fest is coming to a close, I am
marking this patch as returned with feedback.  Feel free to correct me
if you think this is not adapted.
--
Michael

Attachment