Thread: .ready files appearing on slaves

.ready files appearing on slaves

From

Jehan-Guillaume de Rorthais

Date:

04 September 2014, 15:50:28

Hi hackers,

Since few months, we occasionally see .ready files appearing on some slaveinstances from various context. The two I
havein mind are under 9.2.x.

I tried to investigate a bit. These .ready files are created when a WAL file
from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
easily experience this by deleting such a file: it is created again at the next
restartpoint or checkpoint received from the master.

Looking at the WAL in pg_xlog folder corresponding to these .ready files, they
are all much older than the current WAL "cycle" in both mtime and name logic
sequence. As instance on one of these box we have currently 6 of those "ghost"
WALs:
0000000200001E53000000FF 0000000200001F18000000FF 0000000200002047000000FF 00000002000020BF000000FF
0000000200002140000000FF0000000200002370000000FF 000000020000255D000000A8 000000020000255D000000A9 [...normal WAL
sequence...]000000020000255E0000009D

And on another box:
000000010000040E000000FF 0000000100000414000000DA 000000010000046E000000FF 0000000100000470000000FF
00000001000004850000000F000000010000048500000010 [...normal WAL sequence...] 000000010000048500000052

So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

For one of these servers, I could correlate this with some brutal disconnection
of the streaming replication appearing in its logs. But there was no known SR
disconnection on the second one.

Any idea about this weird behaviour? What can we do to help you investigate
further?

Regards,
--
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

Re: .ready files appearing on slaves

From

Jehan-Guillaume de Rorthais

Date:

15 September 2014, 15:37:54

Hi hackers,

An issue that seems related to this has been posted on pgsql-admin. See:
 http://www.postgresql.org/message-id/CAAS3tyLnXaYDZ0+zhXLPdVtOvHQOvR+jSPhp30o8kvWqQs0Tqw@mail.gmail.com

How can we help on this issue?

Cheers,

On Thu, 4 Sep 2014 17:50:36 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

> Hi hackers,
> 
> Since few months, we occasionally see .ready files appearing on some slave
>  instances from various context. The two I have in mind are under 9.2.x.
> 
> I tried to investigate a bit. These .ready files are created when a WAL file
> from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
> easily experience this by deleting such a file: it is created again at the
> next restartpoint or checkpoint received from the master.
> 
> Looking at the WAL in pg_xlog folder corresponding to these .ready files, they
> are all much older than the current WAL "cycle" in both mtime and name logic
> sequence. As instance on one of these box we have currently 6 of those "ghost"
> WALs:
> 
>   0000000200001E53000000FF
>   0000000200001F18000000FF
>   0000000200002047000000FF
>   00000002000020BF000000FF
>   0000000200002140000000FF
>   0000000200002370000000FF
>   000000020000255D000000A8
>   000000020000255D000000A9
>   [...normal WAL sequence...]
>   000000020000255E0000009D
> 
> And on another box:
> 
>   000000010000040E000000FF
>   0000000100000414000000DA
>   000000010000046E000000FF
>   0000000100000470000000FF
>   00000001000004850000000F
>   000000010000048500000010
>   [...normal WAL sequence...]
>   000000010000048500000052
> 
> So it seems for some reasons, these old WALs were "forgotten" by the
> restartpoint mechanism when they should have been recylced/deleted. 
> 
> For one of these servers, I could correlate this with some brutal
> disconnection of the streaming replication appearing in its logs. But there
> was no known SR disconnection on the second one.
> 
> Any idea about this weird behaviour? What can we do to help you investigate
> further?
> 
> Regards,



-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Jehan-Guillaume de Rorthais

Date:

18 September 2014, 16:08:01

Hi hackers,

We spend some time with Guillaume Lelarge studying this issue.

CreateRestartPoint() calls RemoveOldXlogFiles() to drop/recycle old WALs. This
one is calling XLogArchiveCheckDone() to check if the given WAL can be dropped.
As our slave has "archive_mode" & "archive_command" set, XLogArchivingActive()
returns true, forcing XLogArchiveCheckDone() to look for a ".done" file. If the
corresponding ".done" file doesn't exist, XLogArchiveCheckDone() calls
XLogArchiveNotify() to actually create the ".ready" file.

Now, on a standby, we are supposed to create the ".done" files by calling
XLogArchiveForceDone() either after a successful restore_command or when the
streaming replication finishes and flushes a WAL. It seems like the root cause
of this trouble: maybe a race condition where XLogArchiveForceDone() would not
be called for a WAL. As the ".done" file is never created, we end up calling
XLogArchiveNotify() to create a ".ready" and this WAL will never be archived by
the slave, sitting in the pg_xlog dir.

But worst, we experience this under two 9.2.8 clusters and realized
this version is not supposed to have 256 segments per WAL, finishing a WAL to
the *FE one. I'm quoting back the history of my previous mail as we clearly
have some *FF files on the slave side!

> >   000000010000040E000000FF
> >   0000000100000414000000DA
> >   000000010000046E000000FF
> >   0000000100000470000000FF
> >   00000001000004850000000F
> >   000000010000048500000010
> >   [...normal WAL sequence...]
> >   000000010000048500000052

Note that it seems to happen only during streaming replication, not in pure Log
Shipping replication. We added the debug message on the master side to make
sure it never archive *FF file

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Regards,

On Mon, 15 Sep 2014 17:37:24 +0200
Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

> Hi hackers,
> 
> An issue that seems related to this has been posted on pgsql-admin. See:
> 
>   http://www.postgresql.org/message-id/CAAS3tyLnXaYDZ0+zhXLPdVtOvHQOvR+jSPhp30o8kvWqQs0Tqw@mail.gmail.com
> 
> How can we help on this issue?
> 
> Cheers,
> 
> On Thu, 4 Sep 2014 17:50:36 +0200
> Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:
> 
> > Hi hackers,
> > 
> > Since few months, we occasionally see .ready files appearing on some slave
> >  instances from various context. The two I have in mind are under 9.2.x.
> > 
> > I tried to investigate a bit. These .ready files are created when a WAL file
> > from pg_xlog has no corresponding file in pg_xlog/archive_status. I could
> > easily experience this by deleting such a file: it is created again at the
> > next restartpoint or checkpoint received from the master.
> > 
> > Looking at the WAL in pg_xlog folder corresponding to these .ready files,
> > they are all much older than the current WAL "cycle" in both mtime and name
> > logic sequence. As instance on one of these box we have currently 6 of
> > those "ghost" WALs:
> > 
> >   0000000200001E53000000FF
> >   0000000200001F18000000FF
> >   0000000200002047000000FF
> >   00000002000020BF000000FF
> >   0000000200002140000000FF
> >   0000000200002370000000FF
> >   000000020000255D000000A8
> >   000000020000255D000000A9
> >   [...normal WAL sequence...]
> >   000000020000255E0000009D
> > 
> > And on another box:
> > 
> >   000000010000040E000000FF
> >   0000000100000414000000DA
> >   000000010000046E000000FF
> >   0000000100000470000000FF
> >   00000001000004850000000F
> >   000000010000048500000010
> >   [...normal WAL sequence...]
> >   000000010000048500000052
> > 
> > So it seems for some reasons, these old WALs were "forgotten" by the
> > restartpoint mechanism when they should have been recylced/deleted. 
> > 
> > For one of these servers, I could correlate this with some brutal
> > disconnection of the streaming replication appearing in its logs. But there
> > was no known SR disconnection on the second one.
> > 
> > Any idea about this weird behaviour? What can we do to help you investigate
> > further?
> > 
> > Regards,
> 
> 
> 

-- 
Jehan-Guillaume de Rorthais
Dalibo
http://www.dalibo.com

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

08 October 2014, 07:45:04

On Fri, Sep 19, 2014 at 1:07 AM, Jehan-Guillaume de Rorthais <jgdr@dalibo.com> wrote:

We kept the WAL files and log files for further analysis. How can we help
regarding this issue?

Commit c2f79ba has added as assumption that the WAL receiver should always enforce the create of .done files when WAL files are done being streamed (XLogWalRcvWrite and WalReceiverMain) or archived (KeepFileRestoredFromArchive). Then using this assumption 1bd42cd has changed a bit RemoveOldXlogFiles, removing a check looking if the node is in recovery. Now, based on the information given here yes it happens that there are still cases where .done file creation is not correctly done, leading to those extra files. Even by looking at the code, I am not directly seeing any code paths where an extra call to XLogArchiveForceDone would be needed on the WAL receiver side but... Something like the patch attached (which is clearly a band-aid) may help though as it would make files to be removed even if they are not marked as .done for a node in recovery. And this is consistent with the pre-1bd42cd.

Comments welcome.

Michael

Attachment

20141008_xlog_file_enforce_deletion.patch

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Heikki Linnakangas

Date:

08 October 2014, 09:54:24

On 10/08/2014 10:44 AM, Michael Paquier wrote:
> On Fri, Sep 19, 2014 at 1:07 AM, Jehan-Guillaume de Rorthais <
> jgdr@dalibo.com> wrote:
>
>> We kept the WAL files and log files for further analysis. How can we help
>> regarding this issue?
>>
>
> Commit c2f79ba has added as assumption that the WAL receiver should always
> enforce the create of .done files when WAL files are done being streamed
> (XLogWalRcvWrite and WalReceiverMain) or archived
> (KeepFileRestoredFromArchive). Then using this assumption 1bd42cd has
> changed a bit RemoveOldXlogFiles, removing a check looking if the node is
> in recovery. Now, based on the information given here yes it happens that
> there are still cases where .done file creation is not correctly done,
> leading to those extra files. Even by looking at the code, I am not
> directly seeing any code paths where an extra call to XLogArchiveForceDone
> would be needed on the WAL receiver side but... Something like the patch
> attached (which is clearly a band-aid) may help though as it would make
> files to be removed even if they are not marked as .done for a node in
> recovery. And this is consistent with the pre-1bd42cd.

There are two mysteries here:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed 
to created, ever.

Since this only happens with streaming replication, the FF segments are 
probably being created by walreceiver. XLogWalRcvWrite is the function 
that opens the file. I don't see anything obviously wrong there. 
XLogWalRcvWrite opens the file corresponding the start position in the 
message received from the master. There is no check that the start 
position is valid, though; if the master sends a start position in the 
FF segment, walreceiver will merrily write it. So the problem could be 
in the walsender side. However, I don't see anything wrong there either.

I think we should add a check in walreceiver, to throw an error if the 
master sends an invalid WAL pointer, pointing to an FF segment.

2. Why are the .done files sometimes not being created?

I may have an explanation for that. Walreceiver creates a .done file 
when it closes an old segment and opens a new one. However, it does this 
only when it's about to start writing to the new segment, and still has 
the old segment open. If you stream the FE segment fully, but drop 
replication connection at exactly that point, the .done file is not 
created. That might sound unlikely, but it's actually pretty easy to 
trigger. Just do "select pg_switch_xlog()" in the master, followed by 
"pg_ctl stop -m i" and a restart.

The creation of the .done files seems quite unreliable anyway. If only a 
portion of a segment is streamed, we don't write a .done file for it, so 
we still have the original problem that we will try to archive the 
segment after failover, even though the master might already have 
archived it.

I looked again at the thread where this was discussed: 
http://www.postgresql.org/message-id/flat/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com. 
I believe the idea was that the server that generates a WAL segment is 
always responsible for archiving it. A standby should never attempt to 
archive a WAL segment that was restored from the archive, or streamed 
from the master.

In that thread, it was not discussed what should happen to WAL files 
that an admin manually copies into pg_xlog of the standby. Should the 
standby archive them? I don't think so - the admin should copy them 
manually to the archive too, if he wants them archived. It's a good and 
simple rule that the server that generates the WAL, archives the WAL.

Instead of creating any .done files during recovery, we could scan 
pg_xlog at promotion, and create a .done file for every WAL segment 
that's present at that point. That would be more robust. And then apply 
your patch, to recycle old segments during archive recovery, ignoring 
.done files.

- Heikki

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

08 October 2014, 13:00:48

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

1. Where do the FF files come from? In 9.2, FF-segments are not supposed to created, ever.

Since this only happens with streaming replication, the FF segments are probably being created by walreceiver. XLogWalRcvWrite is the function that opens the file. I don't see anything obviously wrong there. XLogWalRcvWrite opens the file corresponding the start position in the message received from the master. There is no check that the start position is valid, though; if the master sends a start position in the FF segment, walreceiver will merrily write it. So the problem could be in the walsender side. However, I don't see anything wrong there either.

Good to hear that. By looking at the wal receiver and sender code paths, I found nothing really weird.

I think we should add a check in walreceiver, to throw an error if the master sends an invalid WAL pointer, pointing to an FF segment.

Then we're good for a check in ProcessWalSndrMessage for walEnd I guess. Seems like a straight-forward patch.

2. Why are the .done files sometimes not being created?

I may have an explanation for that. Walreceiver creates a .done file when it closes an old segment and opens a new one. However, it does this only when it's about to start writing to the new segment, and still has the old segment open. If you stream the FE segment fully, but drop replication connection at exactly that point, the .done file is not created. That might sound unlikely, but it's actually pretty easy to trigger. Just do "select pg_switch_xlog()" in the master, followed by "pg_ctl stop -m i" and a restart.

That's exactly the test I have been doing a couple of times to trigger this behavior before sending my previous email, but without success on the standby with master: all the WAL files were marked as .done. Now, I have just retried it, with far more tries on REL9_3_STABLE and on HEAD and I have been able to actually trigger the problem a couple of times. Simply run a long transaction generating a lot of WAL like that:
=# create table aa as select generate_series(1,1000000000);
And then run that:
$ psql -c 'select pg_switch_xlog()'; pg_ctl stop -m immediate; pg_ctl start

And with enough "luck", .ready files may appear. It may take a dozen of tries before seeing at least ones. And I noticed that generally multiple .ready files appear at the same time.

The creation of the .done files seems quite unreliable anyway. If only a portion of a segment is streamed, we don't write a .done file for it, so we still have the original problem that we will try to archive the segment after failover, even though the master might already have archived it.

Yep. Agreed.

I looked again at the thread where this was discussed: http://www.postgresql.org/message-id/flat/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com. I believe the idea was that the server that generates a WAL segment is always responsible for archiving it. A standby should never attempt to archive a WAL segment that was restored from the archive, or streamed from the master.

In that thread, it was not discussed what should happen to WAL files that an admin manually copies into pg_xlog of the standby. Should the standby archive them? I don't think so - the admin should copy them manually to the archive too, if he wants them archived. It's a good and simple rule that the server that generates the WAL, archives the WAL.

Question time: why has the check based on recovery state of the node been removed in 1bd42cd? Just assuming, but did you have in mind that relying on XLogArchiveForceDone and XLogArchiveCheckDone was enough and more robust at this point?

Instead of creating any .done files during recovery, we could scan pg_xlog at promotion, and create a .done file for every WAL segment that's present at that point. That would be more robust. And then apply your patch, to recycle old segments during archive recovery, ignoring .done files.

The additional process at promotion sounds like a good idea, I'll try to get a patch done tomorrow. This would result as well in removing the XLogArchiveForceDone stuff. Either way, not that I have been able to reproduce the problem manually, things can be clearly solved.

Regards,

--
Michael

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Fujii Masao

Date:

08 October 2014, 13:59:31

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 10/08/2014 10:44 AM, Michael Paquier wrote:
>>
>> On Fri, Sep 19, 2014 at 1:07 AM, Jehan-Guillaume de Rorthais <
>> jgdr@dalibo.com> wrote:
>>
>>> We kept the WAL files and log files for further analysis. How can we help
>>> regarding this issue?
>>>
>>
>> Commit c2f79ba has added as assumption that the WAL receiver should always
>> enforce the create of .done files when WAL files are done being streamed
>> (XLogWalRcvWrite and WalReceiverMain) or archived
>> (KeepFileRestoredFromArchive). Then using this assumption 1bd42cd has
>> changed a bit RemoveOldXlogFiles, removing a check looking if the node is
>> in recovery. Now, based on the information given here yes it happens that
>> there are still cases where .done file creation is not correctly done,
>> leading to those extra files. Even by looking at the code, I am not
>> directly seeing any code paths where an extra call to XLogArchiveForceDone
>> would be needed on the WAL receiver side but... Something like the patch
>> attached (which is clearly a band-aid) may help though as it would make
>> files to be removed even if they are not marked as .done for a node in
>> recovery. And this is consistent with the pre-1bd42cd.
>
>
>
> There are two mysteries here:
>
> 1. Where do the FF files come from? In 9.2, FF-segments are not supposed to
> created, ever.
>
> Since this only happens with streaming replication, the FF segments are
> probably being created by walreceiver. XLogWalRcvWrite is the function that
> opens the file. I don't see anything obviously wrong there. XLogWalRcvWrite
> opens the file corresponding the start position in the message received from
> the master. There is no check that the start position is valid, though; if
> the master sends a start position in the FF segment, walreceiver will
> merrily write it. So the problem could be in the walsender side. However, I
> don't see anything wrong there either.
>
> I think we should add a check in walreceiver, to throw an error if the
> master sends an invalid WAL pointer, pointing to an FF segment.
>
>
> 2. Why are the .done files sometimes not being created?
>
> I may have an explanation for that. Walreceiver creates a .done file when it
> closes an old segment and opens a new one. However, it does this only when
> it's about to start writing to the new segment, and still has the old
> segment open. If you stream the FE segment fully, but drop replication
> connection at exactly that point, the .done file is not created. That might
> sound unlikely, but it's actually pretty easy to trigger. Just do "select
> pg_switch_xlog()" in the master, followed by "pg_ctl stop -m i" and a
> restart.
>
> The creation of the .done files seems quite unreliable anyway. If only a
> portion of a segment is streamed, we don't write a .done file for it, so we
> still have the original problem that we will try to archive the segment
> after failover, even though the master might already have archived it.
>
> I looked again at the thread where this was discussed:
> http://www.postgresql.org/message-id/flat/CAHGQGwHVYqbX=A+zo+AvFbVHLGoypO9G_QDKbabeXgXBVGd05g@mail.gmail.com.
> I believe the idea was that the server that generates a WAL segment is
> always responsible for archiving it. A standby should never attempt to
> archive a WAL segment that was restored from the archive, or streamed from
> the master.
>
> In that thread, it was not discussed what should happen to WAL files that an
> admin manually copies into pg_xlog of the standby. Should the standby
> archive them? I don't think so - the admin should copy them manually to the
> archive too, if he wants them archived. It's a good and simple rule that the
> server that generates the WAL, archives the WAL.
>
> Instead of creating any .done files during recovery, we could scan pg_xlog
> at promotion, and create a .done file for every WAL segment that's present
> at that point. That would be more robust. And then apply your patch, to
> recycle old segments during archive recovery, ignoring .done files.

What happens if a user shutdowns the standby, removes recovery.conf and
starts the server as the master? In this case, no WAL files have .done status
files, so the server will create .ready and archive all of them. Probably this
is problematic. So even if we adopt your idea, ISTM that it's better to create
.done file whenever WAL file is fullly streamed and restored.

Regards,

-- 
Fujii Masao

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Heikki Linnakangas

Date:

08 October 2014, 14:39:38

On 10/08/2014 04:59 PM, Fujii Masao wrote:
> On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Instead of creating any .done files during recovery, we could scan pg_xlog
>> at promotion, and create a .done file for every WAL segment that's present
>> at that point. That would be more robust. And then apply your patch, to
>> recycle old segments during archive recovery, ignoring .done files.
>
> What happens if a user shutdowns the standby, removes recovery.conf and
> starts the server as the master?

Um, that's not a safe thing to do anyway, is it?

- Heikki

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

08 October 2014, 23:24:05

<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Wed, Oct 8, 2014 at 11:38 PM, Heikki
Linnakangas<span dir="ltr"><<a href="mailto:hlinnakangas@vmware.com"
target="_blank">hlinnakangas@vmware.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"><span class="">On 10/08/2014 04:59 PM, Fujii Masao wrote:<br
/></span><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><span
class="">On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas<br /> <<a href="mailto:hlinnakangas@vmware.com"
target="_blank">hlinnakangas@vmware.com</a>>wrote:<br /></span><span class=""><blockquote class="gmail_quote"
style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> Instead of creating any .done files during
recovery,we could scan pg_xlog<br /> at promotion, and create a .done file for every WAL segment that's present<br />
atthat point. That would be more robust. And then apply your patch, to<br /> recycle old segments during archive
recovery,ignoring .done files.<br /></blockquote><br /> What happens if a user shutdowns the standby, removes
recovery.confand<br /> starts the server as the master?<br /></span></blockquote><br /> Um, that's not a safe thing to
doanyway, is it?<span class="HOEnZb"><font color="#888888"><br /></font></span></blockquote><div
class="gmail_extra">That'snot safe as it bypasses all the consistency checks of promotion. Now, it is also something
thatrepmgr for example does as far as I recall to do a node "promotion". What if we simply document the problem
properlythen? The apparition of those phantom WAL files is more scary than a user or a utility that does a promotion
witha server restart. Not to mention as well that users as free to add themselves files to pg_xlog.<br />-- <br
/>Michael<br/></div></div></div></div>

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

09 October 2014, 06:26:55

On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

The additional process at promotion sounds like a good idea, I'll try to get a patch done tomorrow. This would result as well in removing the XLogArchiveForceDone stuff. Either way, not that I have been able to reproduce the problem manually, things can be clearly solved.

Please find attached two patches aimed to fix this issue and to improve the situation:

- 0001 prevents the apparition of those phantom WAL segment file by ensuring that when a node is in recovery it will remove it whatever its status in archive_status. This patch is the real fix, and should be applied down to 9.2.

- 0002 is a patch implementing Heikki's idea of enforcing all the segment files present in pg_xlog to have their status to .done, marking them for removal. When looking at the code, I finally concluded that Fujii-san's point, about marking in all cases as .done segment files that have been fully streamed, actually makes more sense to not be backward. This patch would actually not be mandatory for back-patching, but it makes the process more robust IMO.

I imagine that it would be nice to get those things fixed before the next minor release.

Regards,

--
Michael

Attachment

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

09 October 2014, 08:09:36

On Wed, Oct 8, 2014 at 6:54 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> 1. Where do the FF files come from? In 9.2, FF-segments are not supposed to
> created, ever.
>
> I think we should add a check in walreceiver, to throw an error if the
> master sends an invalid WAL pointer, pointing to an FF segment.
Attached is a patch for that. This would be needed for PG < 9.3 if applied.
Regards,
--
Michael

Attachment

20141009_walreceiver_check_invalid.patch

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Fujii Masao

Date:

17 October 2014, 12:23:55

On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
> On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>>
>> The additional process at promotion sounds like a good idea, I'll try to
>> get a patch done tomorrow. This would result as well in removing the
>> XLogArchiveForceDone stuff. Either way, not that I have been able to
>> reproduce the problem manually, things can be clearly solved.
>
> Please find attached two patches aimed to fix this issue and to improve the
> situation:
> - 0001 prevents the apparition of those phantom WAL segment file by ensuring
> that when a node is in recovery it will remove it whatever its status in
> archive_status. This patch is the real fix, and should be applied down to
> 9.2.
> - 0002 is a patch implementing Heikki's idea of enforcing all the segment
> files present in pg_xlog to have their status to .done, marking them for
> removal. When looking at the code, I finally concluded that Fujii-san's
> point, about marking in all cases as .done segment files that have been
> fully streamed, actually makes more sense to not be backward. This patch
> would actually not be mandatory for back-patching, but it makes the process
> more robust IMO.

Thanks for the patches!

I found one problem in the 0002 patch. The patch changes the recovery so that
it creates .done files for every WAL files which exist in pg_xlog directory at
the end of recovery. But even WAL files which will have to be archived later
can exist in pg_xlog at that moment. For example, the latest, recycled and
fully-written-but-not-archived-yet (i.e., maybe having .ready files) WAL files.
The patch wrongly prevents them from being archvied at all.

ISTM that the 0001 patch has the similar problem. Please imagine the following
scenario.

1. There are many unarchived WAL files in pg_xlog because of the continuous   failure of archive_command, for example,
andthen the server unfortunately   crashes because of the corruption of database itself.

2. DBA restores the backup onto the server and copies all the WAL files    from old pg_xlog to new one. Then he or she
preparesfor archive recovery.

3. DBA starts the server and the archive recovery starts.

4. After all the archived WAL files are replayed, the server tries to replay    the WAL files in pg_xlog. Since there
aremany WAL files in pg_xlog,    more than one restartpoints happen while they are being replayed.

In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

Regards,

-- 
Fujii Masao

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Fujii Masao

Date:

17 October 2014, 13:12:14

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>
>>
>> On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com>
>> wrote:
>>>
>>> The additional process at promotion sounds like a good idea, I'll try to
>>> get a patch done tomorrow. This would result as well in removing the
>>> XLogArchiveForceDone stuff. Either way, not that I have been able to
>>> reproduce the problem manually, things can be clearly solved.
>>
>> Please find attached two patches aimed to fix this issue and to improve the
>> situation:
>> - 0001 prevents the apparition of those phantom WAL segment file by ensuring
>> that when a node is in recovery it will remove it whatever its status in
>> archive_status. This patch is the real fix, and should be applied down to
>> 9.2.
>> - 0002 is a patch implementing Heikki's idea of enforcing all the segment
>> files present in pg_xlog to have their status to .done, marking them for
>> removal. When looking at the code, I finally concluded that Fujii-san's
>> point, about marking in all cases as .done segment files that have been
>> fully streamed, actually makes more sense to not be backward. This patch
>> would actually not be mandatory for back-patching, but it makes the process
>> more robust IMO.
>
> Thanks for the patches!

While reviewing the patch, I found another bug related to WAL file in recovery
mode. The problem is that exitArchiveRecovery() always creates .ready file for
the last WAL file of the old timeline even when it's restored from the archive
and has .done file. So this causes the already-archived WAL file to be archived
again.... Attached patch fixes this bug.

Regards,

--
Fujii Masao

Attachment

bugfix-last-restored-wal.patch

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

17 October 2014, 13:37:19

<div dir="ltr">On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <span dir="ltr"><<a href="mailto:masao.fujii@gmail.com"
target="_blank">masao.fujii@gmail.com</a>></span>wrote:<br /><div class="gmail_extra"><div
class="gmail_quote"><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc
solid;padding-left:1ex">In this case, the patch seems to make the restartpoint recycle even WAL files<br /> which have
.readyfiles and will have to be archived later. Thought?<span><font color="#888888"><br
/></font></span></blockquote></div>Thereal problem currently is that it is possible to have a segment file not marked
as.done during recovery when stream connection is abruptly cut when this segment is switched, marking it as .ready in
archive_statusand simply letting this segment in pg_xlog because it will neither be recycled nor removed. I have not
beenable to look much at this code these days, so I am not sure how invasive it would be in back-branches, but perhaps
weshould try to improve code such as when a segment file is switched and connection to the is cut, we guarantee that
thisfile is completed and marked as .done.<br />-- <br />Michael<br /></div></div>

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

20 October 2014, 06:26:38

On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
> On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>> In this case, the patch seems to make the restartpoint recycle even WAL files
>> which have .ready files and will have to be archived later. Thought?
>
> The real problem currently is that it is possible to have a segment file not marked as .done during recovery when
streamconnection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply
lettingthis segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at
thiscode these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve
codesuch as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and
markedas .done. 

I have spent more time on that, with a bit more of underground work...
First, the problem can be reproduced most of the time by running this
simple command:
psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate

This will enforce a segment file switch and restart the master in
crash recovery. This has as effect to immediately cut the WAL stream
on slave, symbolized by a FATAL in libpqrcv_receive where rawlen == 0.
For example, let's imagine that stream fails when switching from
000000010000000000000003 to the next segment, then the
the last XLogRecPtr in XLogWalRcvProcessMsg for dataStart is for
example 0/3100000, and walrcv->latestWalEnd is 0/4000000. When stream
restarts it will begin once again from 0/4000000, ignoring that
000000010000000000000003 should be marked as .done, ultimately marking
it in .ready state when old segment files are recycled or removed.
There is nothing that can really be done to enforce the creation of a
.done file before the FATAL of libpqrcv_receive because we cannot
predict the stream failure..

Now, we can do better than what we have now by looking at WAL start
position used when starting streaming in WAL receiver and enforce
.done if the start position is the last one of previous segment.
Hence, in the case of start position 0/4000000 that was found
previously, the file that will be enforced to .done is
000000010000000000000003. I have written the patch attached that
implements this idea and fixes the problem. Now let's see if you guys
see any flaws in this simple logic which uses a sniper gun instead of
a bazooka as in the previous patches sent.

Regards,
--
Michael

Attachment

20141020_wal_phantom_fix.patch

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

20 October 2014, 08:15:31

On Fri, Oct 17, 2014 at 10:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>
>>
>> On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier <michael.paquier@gmail.com>
>> wrote:
>>>
>>> The additional process at promotion sounds like a good idea, I'll try to
>>> get a patch done tomorrow. This would result as well in removing the
>>> XLogArchiveForceDone stuff. Either way, not that I have been able to
>>> reproduce the problem manually, things can be clearly solved.
>>
>> Please find attached two patches aimed to fix this issue and to improve the
>> situation:
>> - 0001 prevents the apparition of those phantom WAL segment file by ensuring
>> that when a node is in recovery it will remove it whatever its status in
>> archive_status. This patch is the real fix, and should be applied down to
>> 9.2.
>> - 0002 is a patch implementing Heikki's idea of enforcing all the segment
>> files present in pg_xlog to have their status to .done, marking them for
>> removal. When looking at the code, I finally concluded that Fujii-san's
>> point, about marking in all cases as .done segment files that have been
>> fully streamed, actually makes more sense to not be backward. This patch
>> would actually not be mandatory for back-patching, but it makes the process
>> more robust IMO.
>
> Thanks for the patches!

While reviewing the patch, I found another bug related to WAL file in recovery
mode. The problem is that exitArchiveRecovery() always creates .ready file for
the last WAL file of the old timeline even when it's restored from the archive
and has .done file. So this causes the already-archived WAL file to be archived
again.... Attached patch fixes this bug.

That's a good catch! Patch looks good. I think that you should change xlogpath to use MAXFNAMELEN instead of MAXPGPATH in exitArchiveRecovery. This is only for correctness, so that's a master-only remark, because this variable is used only to calculate a segment file name, and not a path. Renaming the variable from xlogpath to xlogname would make sense as well.

Regards,

--
Michael

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Heikki Linnakangas

Date:

22 October 2014, 11:54:38

On 10/20/2014 09:26 AM, Michael Paquier wrote:
> On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>
>> On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>
>>> In this case, the patch seems to make the restartpoint recycle even WAL files
>>> which have .ready files and will have to be archived later. Thought?
>>
>> The real problem currently is that it is possible to have a segment file not marked as .done during recovery when
streamconnection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply
lettingthis segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at
thiscode these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve
codesuch as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and
markedas .done.
 
>
> I have spent more time on that, with a bit more of underground work...
> First, the problem can be reproduced most of the time by running this
> simple command:
> psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate
>
> This will enforce a segment file switch and restart the master in
> crash recovery. This has as effect to immediately cut the WAL stream
> on slave, symbolized by a FATAL in libpqrcv_receive where rawlen == 0.
> For example, let's imagine that stream fails when switching from
> 000000010000000000000003 to the next segment, then the
> the last XLogRecPtr in XLogWalRcvProcessMsg for dataStart is for
> example 0/3100000, and walrcv->latestWalEnd is 0/4000000. When stream
> restarts it will begin once again from 0/4000000, ignoring that
> 000000010000000000000003 should be marked as .done, ultimately marking
> it in .ready state when old segment files are recycled or removed.
> There is nothing that can really be done to enforce the creation of a
> .done file before the FATAL of libpqrcv_receive because we cannot
> predict the stream failure..
>
> Now, we can do better than what we have now by looking at WAL start
> position used when starting streaming in WAL receiver and enforce
> .done if the start position is the last one of previous segment.
> Hence, in the case of start position 0/4000000 that was found
> previously, the file that will be enforced to .done is
> 000000010000000000000003. I have written the patch attached that
> implements this idea and fixes the problem. Now let's see if you guys
> see any flaws in this simple logic which uses a sniper gun instead of
> a bazooka as in the previous patches sent.

Hmm. This will still miss the .done file if you don't re-establish the 
streaming replication connection after the restart. For example, if you 
shut down the master, and promote the standby server.

I think we should take a more wholesale approach to this. We should 
enforce the rule that the server only ever archives WAL files belonging 
to the same timeline that the server generates. IOW, the server only 
archives the WAL that it has generated.

- Heikki

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

22 October 2014, 13:20:57

On Fri, Oct 17, 2014 at 2:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

I found one problem in the 0002 patch. The patch changes the recovery so that
it creates .done files for every WAL files which exist in pg_xlog directory at
the end of recovery. But even WAL files which will have to be archived later
can exist in pg_xlog at that moment. For example, the latest, recycled and
fully-written-but-not-archived-yet (i.e., maybe having .ready files) WAL files.
The patch wrongly prevents them from being archived at all.

Re-looking at patch 2, yes you are right. Even if it was mentioned that we should do that for a node that had promotion triggered it was not done this way as a check on CheckForStandbyTrigger() is actually missing. Attached is an updated patch.

ISTM that the 0001 patch has the similar problem. Please imagine the following
scenario.
In this case, the patch seems to make the restartpoint recycle even WAL files
which have .ready files and will have to be archived later. Thought?

Right, that's really backward. This was the approach taken before c9cc7e0, and this commit actually prevents removal of unarchived WAL files during recovery.

--
Michael

Attachment

20141022_enforce_done_at_promotion.patch

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Heikki Linnakangas

Date:

22 October 2014, 19:29:07

On 10/22/2014 04:24 PM, Michael Paquier wrote:
> On Wed, Oct 22, 2014 at 1:53 PM, Heikki Linnakangas <hlinnakangas@vmware.com
>> wrote:
>
>> I think we should take a more wholesale approach to this. We should
>> enforce the rule that the server only ever archives WAL files belonging to
>> the same timeline that the server generates. IOW, the server only archives
>> the WAL that it has generated.
>
> Hm?! Would that be really back-patchable? There may be in the wild tools or
> users that rely on the fact a node archives segment files from all
> timelines.

Hmm, so it would be a tool or user that manually copies a file to the 
pg_xlog directory of a standby server, and expects the standby to 
archive the file after promotion. That seems a bit far-fetched, although 
I've seen people do strange things. I think it would be acceptable as 
long as we document the change in behavior in the release notes. I don't 
have much hope that we'll ever be able to nail down the correct behavior 
with the current approach.

- Heikki

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Fujii Masao

Date:

23 October 2014, 05:59:26

On Mon, Oct 20, 2014 at 3:26 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>>
>> On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>
>>> In this case, the patch seems to make the restartpoint recycle even WAL files
>>> which have .ready files and will have to be archived later. Thought?
>>
>> The real problem currently is that it is possible to have a segment file not marked as .done during recovery when
streamconnection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply
lettingthis segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at
thiscode these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve
codesuch as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and
markedas .done. 
>
> I have spent more time on that, with a bit more of underground work...
> First, the problem can be reproduced most of the time by running this
> simple command:
> psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate

What about fixing this problem directly? That is, we can make walreceiver
check whether the end of last received WAL data is the end of current WAL file
or not, and then close the WAL file and create .done file if the test is true.

This is not a perfect solution. If the standby crashes during very
short interval
(i.e., after closing the WAL file and before creating .done file), the problem
would happen again. But it can really rarely happen, so I don't think that it's
worth fixing the corner case at least in back-branches. Of course, we can
find out the "perfect" solution for the master, though.

Regards,

--
Fujii Masao

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Fujii Masao

Date:

23 October 2014, 07:34:11

On Mon, Oct 20, 2014 at 5:15 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
>
>
> On Fri, Oct 17, 2014 at 10:12 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>
>> On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com>
>> wrote:
>> > On Thu, Oct 9, 2014 at 3:26 PM, Michael Paquier
>> > <michael.paquier@gmail.com> wrote:
>> >>
>> >>
>> >> On Wed, Oct 8, 2014 at 10:00 PM, Michael Paquier
>> >> <michael.paquier@gmail.com>
>> >> wrote:
>> >>>
>> >>> The additional process at promotion sounds like a good idea, I'll try
>> >>> to
>> >>> get a patch done tomorrow. This would result as well in removing the
>> >>> XLogArchiveForceDone stuff. Either way, not that I have been able to
>> >>> reproduce the problem manually, things can be clearly solved.
>> >>
>> >> Please find attached two patches aimed to fix this issue and to improve
>> >> the
>> >> situation:
>> >> - 0001 prevents the apparition of those phantom WAL segment file by
>> >> ensuring
>> >> that when a node is in recovery it will remove it whatever its status
>> >> in
>> >> archive_status. This patch is the real fix, and should be applied down
>> >> to
>> >> 9.2.
>> >> - 0002 is a patch implementing Heikki's idea of enforcing all the
>> >> segment
>> >> files present in pg_xlog to have their status to .done, marking them
>> >> for
>> >> removal. When looking at the code, I finally concluded that Fujii-san's
>> >> point, about marking in all cases as .done segment files that have been
>> >> fully streamed, actually makes more sense to not be backward. This
>> >> patch
>> >> would actually not be mandatory for back-patching, but it makes the
>> >> process
>> >> more robust IMO.
>> >
>> > Thanks for the patches!
>>
>> While reviewing the patch, I found another bug related to WAL file in
>> recovery
>> mode. The problem is that exitArchiveRecovery() always creates .ready file
>> for
>> the last WAL file of the old timeline even when it's restored from the
>> archive
>> and has .done file. So this causes the already-archived WAL file to be
>> archived
>> again.... Attached patch fixes this bug.
>
> That's a good catch! Patch looks good. I think that you should change
> xlogpath to use MAXFNAMELEN instead of MAXPGPATH in exitArchiveRecovery.
> This is only for correctness, so that's a master-only remark, because this
> variable is used only to calculate a segment file name, and not a path.
> Renaming the variable from xlogpath to xlogname would make sense as well.

Thanks for the review! Applied.

Regards,

-- 
Fujii Masao

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Heikki Linnakangas

Date:

23 October 2014, 08:09:49

On 10/23/2014 08:59 AM, Fujii Masao wrote:
> On Mon, Oct 20, 2014 at 3:26 PM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
>> On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>>
>>> On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
>>>>
>>>> In this case, the patch seems to make the restartpoint recycle even WAL files
>>>> which have .ready files and will have to be archived later. Thought?
>>>
>>> The real problem currently is that it is possible to have a segment file not marked as .done during recovery when
streamconnection is abruptly cut when this segment is switched, marking it as .ready in archive_status and simply
lettingthis segment in pg_xlog because it will neither be recycled nor removed. I have not been able to look much at
thiscode these days, so I am not sure how invasive it would be in back-branches, but perhaps we should try to improve
codesuch as when a segment file is switched and connection to the is cut, we guarantee that this file is completed and
markedas .done.

>>
>> I have spent more time on that, with a bit more of underground work...
>> First, the problem can be reproduced most of the time by running this
>> simple command:
>> psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate
>
> What about fixing this problem directly? That is, we can make walreceiver
> check whether the end of last received WAL data is the end of current WAL file
> or not, and then close the WAL file and create .done file if the test is true.
>
> This is not a perfect solution. If the standby crashes during very
> short interval
> (i.e., after closing the WAL file and before creating .done file), the problem
> would happen again. But it can really rarely happen, so I don't think that it's
> worth fixing the corner case at least in back-branches. Of course, we can
> find out the "perfect" solution for the master, though.

Sounds reasonable, for back-branches. Although I'm still worried we 
might miss some corner-case unless we go with a more wholesale solution.

At least for master, we should consider changing the way the archiving 
works so that we only archive WAL that was generated in the same server. 
I.e. we should never try to archive WAL files belonging to another timeline.

I just remembered that we discussed a different problem related to this 
some time ago, at 
http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp. 
The conclusion of that was that at promotion, we should not archive the 
last, partial, segment from the old timeline. Also, we could make the 
archive recovery more precise in what WAL files it tries to restore. For 
example, if we're doing recovery with target timeline 4, but the 
timeline history says that segment AB came from timeline 3 (i.e. the 
switch to timeline 4 happened within a later segment), we should not try 
to restore segment AB from timeline 4. It shouldn't exist, so trying to 
restore it is a waste of cycles, but more importantly, if there's some 
sort of confusion with the timelines, and the file exists anyway, we 
should not restore it. Also, if the timeline history says e.g. that we 
switched from TLI 3 to 4 at WAL position 0/AD123456, and we restore 
segment AD from timeline 3, we should stop recovering that segment at 
position 0/AD123456 and try to find the AD segment from timeline 4. We 
currently only do that when streaming, but we should also do that when 
restoring from archive.

In summary, let's do something small for back-branches, like what you 
suggested. But for master, let's do bigger changes to the timeline 
handling. Would you like to write a patch for the back-branches, if I 
start working on master?

- Heikki

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

23 October 2014, 10:26:07

On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 10/23/2014 08:59 AM, Fujii Masao wrote:
Sounds reasonable, for back-branches. Although I'm still worried we might miss some corner-case unless we go with a more wholesale solution.

Don't really want to be the intruder here, but isn't that the simple patch attached? There is still a small window between XLogWalRcvFlush and XLogArchiveForceDone in XLogWalRcvWrite if the standby crashes exactly between them.

--
Michael

Attachment

20141023_switch_walend.patch

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Heikki Linnakangas

Date:

23 October 2014, 11:45:41

On 10/23/2014 01:25 PM, Michael Paquier wrote:
> On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <
> hlinnakangas@vmware.com> wrote:
>
>> On 10/23/2014 08:59 AM, Fujii Masao wrote:
>> Sounds reasonable, for back-branches. Although I'm still worried we might
>> miss some corner-case unless we go with a more wholesale solution.
>>
>
> Don't really want to be the intruder here, but isn't that the simple patch
> attached?

That's not right. Should check *after* the write if the segment was
completed, and close it if so. Like the attached.

> There is still a small window between XLogWalRcvFlush and
> XLogArchiveForceDone in XLogWalRcvWrite if the standby crashes exactly
> between them.

Yeah. I think we can live with that.

- Heikki

Attachment

switch-walend-2.patch

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Fujii Masao

Date:

23 October 2014, 11:55:44

On Thu, Oct 23, 2014 at 5:09 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 10/23/2014 08:59 AM, Fujii Masao wrote:
>>
>> On Mon, Oct 20, 2014 at 3:26 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>>
>>> On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> In this case, the patch seems to make the restartpoint recycle even WAL
>>>>> files
>>>>> which have .ready files and will have to be archived later. Thought?
>>>>
>>>>
>>>> The real problem currently is that it is possible to have a segment file
>>>> not marked as .done during recovery when stream connection is abruptly cut
>>>> when this segment is switched, marking it as .ready in archive_status and
>>>> simply letting this segment in pg_xlog because it will neither be recycled
>>>> nor removed. I have not been able to look much at this code these days, so I
>>>> am not sure how invasive it would be in back-branches, but perhaps we should
>>>> try to improve code such as when a segment file is switched and connection
>>>> to the is cut, we guarantee that this file is completed and marked as .done.
>>>
>>>
>>> I have spent more time on that, with a bit more of underground work...
>>> First, the problem can be reproduced most of the time by running this
>>> simple command:
>>> psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate
>>
>>
>> What about fixing this problem directly? That is, we can make walreceiver
>> check whether the end of last received WAL data is the end of current WAL
>> file
>> or not, and then close the WAL file and create .done file if the test is
>> true.
>>
>> This is not a perfect solution. If the standby crashes during very
>> short interval
>> (i.e., after closing the WAL file and before creating .done file), the
>> problem
>> would happen again. But it can really rarely happen, so I don't think that
>> it's
>> worth fixing the corner case at least in back-branches. Of course, we can
>> find out the "perfect" solution for the master, though.
>
>
> Sounds reasonable, for back-branches. Although I'm still worried we might
> miss some corner-case unless we go with a more wholesale solution.

+1

> At least for master, we should consider changing the way the archiving works
> so that we only archive WAL that was generated in the same server. I.e. we
> should never try to archive WAL files belonging to another timeline.
>
> I just remembered that we discussed a different problem related to this some
> time ago, at
> http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
> The conclusion of that was that at promotion, we should not archive the
> last, partial, segment from the old timeline.

So, the last, partial, segment of the old timeline is never archived?
If yes, I'm afraid that the PITR to the old timeline cannot replay the
last segment. No? Or you're thinking to change the code so that
the segment of new timeline is replayed in that case?

> In summary, let's do something small for back-branches, like what you
> suggested. But for master, let's do bigger changes to the timeline handling.

Yep.

Regards,

-- 
Fujii Masao

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Fujii Masao

Date:

23 October 2014, 12:23:16

On Thu, Oct 23, 2014 at 8:45 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 10/23/2014 01:25 PM, Michael Paquier wrote:
>>
>> On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <
>> hlinnakangas@vmware.com> wrote:
>>
>>> On 10/23/2014 08:59 AM, Fujii Masao wrote:
>>> Sounds reasonable, for back-branches. Although I'm still worried we might
>>> miss some corner-case unless we go with a more wholesale solution.
>>>
>>
>> Don't really want to be the intruder here, but isn't that the simple patch
>> attached?
>
>
> That's not right. Should check *after* the write if the segment was
> completed, and close it if so. Like the attached.

Looks good to me. WalReceiverMain has almost the same code as
what XLogWalRcvFileClose does. So we can refactor that.

>> There is still a small window between XLogWalRcvFlush and
>> XLogArchiveForceDone in XLogWalRcvWrite if the standby crashes exactly
>> between them.
>
>
> Yeah. I think we can live with that.

Yes.

Regards,

-- 
Fujii Masao

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Fujii Masao

Date:

23 October 2014, 12:34:15

On Thu, Oct 23, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Oct 23, 2014 at 8:45 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> On 10/23/2014 01:25 PM, Michael Paquier wrote:
>>>
>>> On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <
>>> hlinnakangas@vmware.com> wrote:
>>>
>>>> On 10/23/2014 08:59 AM, Fujii Masao wrote:
>>>> Sounds reasonable, for back-branches. Although I'm still worried we might
>>>> miss some corner-case unless we go with a more wholesale solution.
>>>>
>>>
>>> Don't really want to be the intruder here, but isn't that the simple patch
>>> attached?
>>
>>
>> That's not right. Should check *after* the write if the segment was
>> completed, and close it if so. Like the attached.
>
> Looks good to me. WalReceiverMain has almost the same code as
> what XLogWalRcvFileClose does. So we can refactor that.

While looking at the code of WAL archiving and recovery, I found
another small issue. The standby always creates .ready file for
the timeline history file even when WAL archiving is not enabled.
Since WAL archiving is off, that .ready file will remain infinitely.
Probably this is almost harmless but confusing, so I'd like to fix that.
Patch attached. Thought?

Regards,

--
Fujii Masao

Attachment

timeline-archive.patch

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

23 October 2014, 12:53:44

On Thu, Oct 23, 2014 at 2:34 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Thu, Oct 23, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Thu, Oct 23, 2014 at 8:45 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> On 10/23/2014 01:25 PM, Michael Paquier wrote:
>>>
>>> On Thu, Oct 23, 2014 at 10:09 AM, Heikki Linnakangas <
>>> hlinnakangas@vmware.com> wrote:
>>>
>>>> On 10/23/2014 08:59 AM, Fujii Masao wrote:
>>>> Sounds reasonable, for back-branches. Although I'm still worried we might
>>>> miss some corner-case unless we go with a more wholesale solution.
>>>>
>>>
>>> Don't really want to be the intruder here, but isn't that the simple patch
>>> attached?
>>
>>
>> That's not right. Should check *after* the write if the segment was
>> completed, and close it if so. Like the attached.
>
> Looks good to me. WalReceiverMain has almost the same code as
> what XLogWalRcvFileClose does. So we can refactor that.

While looking at the code of WAL archiving and recovery, I found
another small issue. The standby always creates .ready file for
the timeline history file even when WAL archiving is not enabled.
Since WAL archiving is off, that .ready file will remain infinitely.
Probably this is almost harmless but confusing, so I'd like to fix that.
Patch attached. Thought?

Good catch once again. We could as well put the check of XLogArchivingActive directly in XLogArchiveNotify...

--
Michael

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

23 October 2014, 13:16:36

<div dir="ltr"><br /><div class="gmail_extra"><br /><div class="gmail_quote">On Thu, Oct 23, 2014 at 1:45 PM, Heikki
Linnakangas<span dir="ltr"><<a href="mailto:hlinnakangas@vmware.com"
target="_blank">hlinnakangas@vmware.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"><span class="">On 10/23/2014 01:25 PM, Michael Paquier wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> On Thu, Oct 23,
2014at 10:09 AM, Heikki Linnakangas <<br /><a href="mailto:hlinnakangas@vmware.com"
target="_blank">hlinnakangas@vmware.com</a>>wrote:<br /><br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"> On 10/23/2014 08:59 AM, Fujii Masao wrote:<br /> Sounds reasonable,
forback-branches. Although I'm still worried we might<br /> miss some corner-case unless we go with a more wholesale
solution.<br/><br /></blockquote><br /> Don't really want to be the intruder here, but isn't that the simple patch<br
/>attached?<br /></blockquote><br /></span> That's not right. Should check *after* the write if the segment was
completed,and close it if so. Like the attached.<span class=""><br /><br /><blockquote class="gmail_quote"
style="margin:00 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"> There is still a small window between
XLogWalRcvFlushand<br /> XLogArchiveForceDone in XLogWalRcvWrite if the standby crashes exactly<br /> between them.<br
/></blockquote><br/></span> Yeah. I think we can live with that.<span class="HOEnZb"><font color="#888888"><br
/></font></span></blockquote></div>Justtested this patch with the combo pg_switch_xlog() + stop/immediate and the
apparitionof the .ready files is not fixed.</div><div class="gmail_extra">Regards,<br />-- <br /></div><div
class="gmail_extra">Michael<br/></div></div>

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

23 October 2014, 14:19:44

On Thu, Oct 23, 2014 at 10:16 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Thu, Oct 23, 2014 at 1:45 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:
That's not right. Should check *after* the write if the segment was completed, and close it if so. Like the attached.

Just tested this patch with the combo pg_switch_xlog() + stop/immediate and the apparition of the .ready files is not fixed.

Btw, I think that we should pass walEnd to XLogWalRcvWrite and add an additional check based on that after the write() loop calls to enforce the segment to be switched to .done.

--
Michael

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Heikki Linnakangas

Date:

24 October 2014, 13:06:07

On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:
> At least for master, we should consider changing the way the archiving
> works so that we only archive WAL that was generated in the same server.
> I.e. we should never try to archive WAL files belonging to another timeline.
>
> I just remembered that we discussed a different problem related to this
> some time ago, at
> http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
> The conclusion of that was that at promotion, we should not archive the
> last, partial, segment from the old timeline.

So, this is what I came up with for master. Does anyone see a problem
with it?

- Heikki

Attachment

0001-Don-t-try-to-archive-WAL-from-different-timelines.patch

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

24 October 2014, 13:33:18

On Fri, Oct 24, 2014 at 3:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:
>>
>> At least for master, we should consider changing the way the archiving
>> works so that we only archive WAL that was generated in the same server.
>> I.e. we should never try to archive WAL files belonging to another
>> timeline.
>>
>> I just remembered that we discussed a different problem related to this
>> some time ago, at
>>
>> http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
>> The conclusion of that was that at promotion, we should not archive the
>> last, partial, segment from the old timeline.
>
>
> So, this is what I came up with for master. Does anyone see a problem with
> it?
Thinking long-term, this is a solid approach, so +1 for it. I just
tested the patch and the extra segment files do not show up anymore.
Patch looks good as well.
-- 
Michael

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Robert Haas

Date:

24 October 2014, 18:05:32

On Fri, Oct 24, 2014 at 9:05 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> So, this is what I came up with for master. Does anyone see a problem with
> it?

In the proposed commit message, you mis-spelled "significant" as "signicant".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Fujii Masao

Date:

27 October 2014, 12:12:48

On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:
>>
>> At least for master, we should consider changing the way the archiving
>> works so that we only archive WAL that was generated in the same server.
>> I.e. we should never try to archive WAL files belonging to another
>> timeline.
>>
>> I just remembered that we discussed a different problem related to this
>> some time ago, at
>>
>> http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
>> The conclusion of that was that at promotion, we should not archive the
>> last, partial, segment from the old timeline.
>
>
> So, this is what I came up with for master. Does anyone see a problem with
> it?

What about the problem that I raised upthread? This is, the patch
prevents the last, partial, WAL file of the old timeline from being archived.
So we can never PITR the database to the point that the last, partial WAL
file has. Isn't this problem? For example, please imagine the
following scenario.

1. The important data was deleted but no one noticed that. This deletion was   logged in last, partial WAL file.
2. The server crashed and DBA started an archive recovery from old backup.
3. After recovery, all WAL files of the old timeline were recycled.
4. Finally DBA noticed the loss of important data and tried to do PITR
to the point   where the data was deleted.

HOWEVER, the WAL file containing that deletion operation no longer exists.
So DBA will never be able to recover that important data ....

Regards,

-- 
Fujii Masao

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Heikki Linnakangas

Date:

27 October 2014, 16:13:00

On 10/27/2014 02:12 PM, Fujii Masao wrote:
> On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:
>>>
>>> At least for master, we should consider changing the way the archiving
>>> works so that we only archive WAL that was generated in the same server.
>>> I.e. we should never try to archive WAL files belonging to another
>>> timeline.
>>>
>>> I just remembered that we discussed a different problem related to this
>>> some time ago, at
>>>
>>> http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
>>> The conclusion of that was that at promotion, we should not archive the
>>> last, partial, segment from the old timeline.
>>
>>
>> So, this is what I came up with for master. Does anyone see a problem with
>> it?
>
> What about the problem that I raised upthread? This is, the patch
> prevents the last, partial, WAL file of the old timeline from being archived.
> So we can never PITR the database to the point that the last, partial WAL
> file has.

A partial WAL file is never archived in the master server to begin with, 
so if it's ever used in archive recovery, the administrator must have 
performed some manual action to copy the partial WAL file from the 
original server. When he does that, he can also copy it manually to the 
archive, or whatever he wants to do with it.

Note that the same applies to any complete, but not-yet archived WAL 
files. But we've never had any mechanism in place to archive those in 
the new instance, after PITR.

> Isn't this problem? For example, please imagine the
> following scenario.
>
> 1. The important data was deleted but no one noticed that. This deletion was
>      logged in last, partial WAL file.
> 2. The server crashed and DBA started an archive recovery from old backup.
> 3. After recovery, all WAL files of the old timeline were recycled.
> 4. Finally DBA noticed the loss of important data and tried to do PITR
> to the point
>      where the data was deleted.
>
> HOWEVER, the WAL file containing that deletion operation no longer exists.
> So DBA will never be able to recover that important data ....

I think you're missing a step above:

1.5: The administrator copies the last, partial WAL file (and any 
complete but not yet-archived files) to the new server's pg_xlog directory.

Without that, it won't be available for PITR anyway, and the new server 
won't see it or try to archive it, no matter what.

- Heikki

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Fujii Masao

Date:

28 October 2014, 12:07:36

On Tue, Oct 28, 2014 at 1:12 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 10/27/2014 02:12 PM, Fujii Masao wrote:
>>
>> On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>>
>>> On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:
>>>>
>>>>
>>>> At least for master, we should consider changing the way the archiving
>>>> works so that we only archive WAL that was generated in the same server.
>>>> I.e. we should never try to archive WAL files belonging to another
>>>> timeline.
>>>>
>>>> I just remembered that we discussed a different problem related to this
>>>> some time ago, at
>>>>
>>>>
>>>> http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
>>>> The conclusion of that was that at promotion, we should not archive the
>>>> last, partial, segment from the old timeline.
>>>
>>>
>>>
>>> So, this is what I came up with for master. Does anyone see a problem
>>> with
>>> it?
>>
>>
>> What about the problem that I raised upthread? This is, the patch
>> prevents the last, partial, WAL file of the old timeline from being
>> archived.
>> So we can never PITR the database to the point that the last, partial WAL
>> file has.
>
>
> A partial WAL file is never archived in the master server to begin with, so
> if it's ever used in archive recovery, the administrator must have performed
> some manual action to copy the partial WAL file from the original server.

Yes.

> When he does that, he can also copy it manually to the archive, or whatever
> he wants to do with it.

Yes, but currently he doesn't need to do that manually. At the end of
archive recovery, .ready file is created and the partial, last, WAL file of
the old timeline would be archived later. It's never recycled until it's
archived. So when subsequent archive recovery is necessary, we can expect
that it exists in either pg_xlog or archival storage. OTOH, if the patch is
applied, it's never archived and may have been already recycled, and then
subsequent archive recovery would fail.

Therefore, the problem that the patch can cause is that the partial WAL file
may be recycled even when it has not been archived yet. The partial WAL file
may not be able to be archived for some reasons, but that's not directly
a problem if it exists in pg_xlog.

Regards,

-- 
Fujii Masao

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Heikki Linnakangas

Date:

28 October 2014, 13:48:31

On 10/27/2014 06:12 PM, Heikki Linnakangas wrote:
> On 10/27/2014 02:12 PM, Fujii Masao wrote:
>> >On Fri, Oct 24, 2014 at 10:05 PM, Heikki Linnakangas
>> ><hlinnakangas@vmware.com>  wrote:
>>> >>On 10/23/2014 11:09 AM, Heikki Linnakangas wrote:
>>>> >>>
>>>> >>>At least for master, we should consider changing the way the archiving
>>>> >>>works so that we only archive WAL that was generated in the same server.
>>>> >>>I.e. we should never try to archive WAL files belonging to another
>>>> >>>timeline.
>>>> >>>
>>>> >>>I just remembered that we discussed a different problem related to this
>>>> >>>some time ago, at
>>>> >>>
>>>> >>>http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
>>>> >>>The conclusion of that was that at promotion, we should not archive the
>>>> >>>last, partial, segment from the old timeline.
>>> >>
>>> >>
>>> >>So, this is what I came up with for master. Does anyone see a problem with
>>> >>it?
>> >
>> >What about the problem that I raised upthread? This is, the patch
>> >prevents the last, partial, WAL file of the old timeline from being archived.
>> >So we can never PITR the database to the point that the last, partial WAL
>> >file has.
> A partial WAL file is never archived in the master server to begin with,
> so if it's ever used in archive recovery, the administrator must have
> performed some manual action to copy the partial WAL file from the
> original server. When he does that, he can also copy it manually to the
> archive, or whatever he wants to do with it.
>
> Note that the same applies to any complete, but not-yet archived WAL
> files. But we've never had any mechanism in place to archive those in
> the new instance, after PITR.

Actually, I'll take back what I said above. I had misunderstood the 
current behavior. Currently, a server *does* archive any files that you 
copy manually to pg_xlog, after PITR has finished. Eventually. We don't 
create a .ready file for them until they're old enough to be recycled. 
We do create a .ready file for the last, partial, segment, but it's 
pretty weird to do it just for that, and not any other, complete, 
segments that might've been copied to pg_xlog. So what happens is that 
the last partial segment gets archived immediately after promotion, but 
any older segments will linger unarchived until much later.

The special treatment of the last partial segment still makes no sense. 
If we want the segments from the old timeline to be archived after PITR, 
we should archive them all immediately after end of recovery, not just 
the partial one. The exception for just the last partial segment is silly.

Now, the bigger question is whether we want the server after PITR to be 
responsible for archiving the segments from the old timeline at all. If 
we do, then we should remove the special treatment of the last, partial 
segment, and create the .ready files for all the complete segments too. 
And actually, I think we should *not* archive the partial segment. We 
don't normally archive partial segments, and all the WAL required to 
restore the server to new timeline is copied to the file with the new 
TLI. If the old timeline is still live, i.e. there's a server somewhere 
still writing new WAL on the old timeline, the partial segment will 
clash with a complete segment that the other server will archive later.

Yet another consideration is that we currently don't archive files 
streamed from the master. If we think that the standby server is 
responsible for archiving old segments after recovery, why is it not 
responsible for archiving the streamed segments? It's because in most 
cases, the master will archive the file, and we don't want two servers 
to archive the same file, but there is actually no guarantee on that. It 
might well be that the archiver runs a little bit behind in the master, 
and after crash the archive will miss some of the segments required. 
That's not good either.

I'm not sure what to do here. The current behavior is inconsistent, and 
there are a some nasty gotchas that would be nice to fix. I think 
someone needs to sit down and write a high-level design of how this all 
should work.

- Heikki

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Dennis Kögel

Date:

10 December 2014, 14:32:23

Hi,

Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais <jgdr@dalibo.com>:
> Since few months, we occasionally see .ready files appearing on some slave
> instances from various context. The two I have in mind are under 9.2.x. […]
> So it seems for some reasons, these old WALs were "forgotten" by the
> restartpoint mechanism when they should have been recylced/deleted.

Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas <hlinnakangas@vmware.com>:
> 1. Where do the FF files come from? In 9.2, FF-segments are not supposed to created, ever. […]
> 2. Why are the .done files sometimes not being created?



We’ve encountered behaviour which seems to match what has been described here: On Streaming Replication slaves, there
isan odd piling up of old WALs and .ready files in pg_xlog, going back several months. 

The fine people on IRC have pointed me to this thread, and have encouraged me to revive it with our observations, so
herewe go: 

Environment:

Master,      9.2.9
|- Slave S1, 9.2.9, on the same network as the master
'- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not* a cascading replication)

wal_keep_segments M=100 S1=100 S2=30
checkpoint_segments M=100 S1=30 S2=30
wal_level hot_standby (all)
archive_mode on (all)
archive_command on both slaves: /bin/true
archive_timeout 600s (all)


- On both slaves, we have „ghost“ WALs and corresponding .ready files (currently >600 of each on S2, slowly becoming a
diskspace problem) 

- There’s always gaps in the ghost WAL names, often roughly 0x20, but not always

- The slave with the „bad“ network link has significantly more of these files, which suggests that disturbances of the
StreamingReplication increase chances of triggering this bug; OTOH, the presence of a name gap pattern suggests the
opposite

- We observe files named *FF as well


As you can see in the directory listings below, this setup is *very* low traffic, which may explain the pattern in WAL
namegaps (?). 

I’ve listed the entries by time, expecting to easily match WALs to their .ready files.
There sometimes is an interesting delay between the WAL’s mtime and the .ready file — especially for *FF, where there’s
severaldays between the WAL and the .ready file. 

- Master:   http://pgsql.privatepaste.com/52ad612dfb
- Slave S1: http://pgsql.privatepaste.com/58b4f3bb10
- Slave S2: http://pgsql.privatepaste.com/a693a8d7f4


I’ve only skimmed through the thread; my understanding is that there were several patches floating around, but nothing
wascommitted. 
If there’s any way I can help, please let me know.

- D.

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Heikki Linnakangas

Date:

12 December 2014, 13:58:57

On 12/10/2014 04:32 PM, Dennis Kögel wrote:
> Hi,
>
> Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais <jgdr@dalibo.com>:
>> Since few months, we occasionally see .ready files appearing on some slave
>> instances from various context. The two I have in mind are under 9.2.x. […]
>> So it seems for some reasons, these old WALs were "forgotten" by the
>> restartpoint mechanism when they should have been recylced/deleted.
>
> Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas <hlinnakangas@vmware.com>:
>> 1. Where do the FF files come from? In 9.2, FF-segments are not supposed to created, ever. […]
>> 2. Why are the .done files sometimes not being created?
>
>
>
> We’ve encountered behaviour which seems to match what has been described here: On Streaming Replication slaves, there
isan odd piling up of old WALs and .ready files in pg_xlog, going back several months.

>
> The fine people on IRC have pointed me to this thread, and have encouraged me to revive it with our observations, so
herewe go:

>
> Environment:
>
> Master,      9.2.9
> |- Slave S1, 9.2.9, on the same network as the master
> '- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not* a cascading replication)
>
> wal_keep_segments M=100 S1=100 S2=30
> checkpoint_segments M=100 S1=30 S2=30
> wal_level hot_standby (all)
> archive_mode on (all)
> archive_command on both slaves: /bin/true
> archive_timeout 600s (all)
>
>
> - On both slaves, we have „ghost“ WALs and corresponding .ready files (currently >600 of each on S2, slowly becoming
adisk space problem)

>
> - There’s always gaps in the ghost WAL names, often roughly 0x20, but not always
>
> - The slave with the „bad“ network link has significantly more of these files, which suggests that disturbances of
theStreaming Replication increase chances of triggering this bug; OTOH, the presence of a name gap pattern suggests the
opposite
>
> - We observe files named *FF as well
>
>
> As you can see in the directory listings below, this setup is *very* low traffic, which may explain the pattern in
WALname gaps (?).

>
> I’ve listed the entries by time, expecting to easily match WALs to their .ready files.
> There sometimes is an interesting delay between the WAL’s mtime and the .ready file — especially for *FF, where
there’sseveral days between the WAL and the .ready file.

>
> - Master:   http://pgsql.privatepaste.com/52ad612dfb
> - Slave S1: http://pgsql.privatepaste.com/58b4f3bb10
> - Slave S2: http://pgsql.privatepaste.com/a693a8d7f4
>
>
> I’ve only skimmed through the thread; my understanding is that there were several patches floating around, but
nothingwas committed.

> If there’s any way I can help, please let me know.

Yeah. It wasn't totally clear how all this should work, so I got 
distracted with other stuff an dropped the ball; sorry.

I'm thinking that we should change the behaviour on master so that the 
standby never archives any files from older timelines, only the new one 
that it generates itself. That will solve the immediate problem of old 
WAL files accumulating, and bogus .ready files appearing in the standby. 
However, it will not solve the bigger problem of how do you ensure that 
all WAL files are archived, when you promote a standby server. There is 
no guarantee on that today anyway, but this will make it even less 
reliable, because it will increase the chances that you miss a file on 
the old timeline in the archive, after promoting. I'd argue that that's 
a good thing; it makes the issue more obvious, so you are more likely to 
encounter it in testing, and you won't be surprised in an emergency. But 
I've started a new thread on that bigger issue, hopefully we'll come up 
with a solution 
(http://www.postgresql.org/message-id/548AF1CB.80702@vmware.com).

Now, what do we do with the back-branches? I'm not sure. Changing the 
behaviour in back-branches could cause nasty surprises. Perhaps it's 
best to just leave it as it is, even though it's buggy.

- Heikki

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Guillaume Lelarge

Date:

31 December 2014, 07:44:29

2014-12-12 14:58 GMT+01:00 Heikki Linnakangas <hlinnakangas@vmware.com>:

On 12/10/2014 04:32 PM, Dennis Kögel wrote:
Hi,

Am 04.09.2014 um 17:50 schrieb Jehan-Guillaume de Rorthais <jgdr@dalibo.com>:
Since few months, we occasionally see .ready files appearing on some slave
instances from various context. The two I have in mind are under 9.2.x. […]
So it seems for some reasons, these old WALs were "forgotten" by the
restartpoint mechanism when they should have been recylced/deleted.

Am 08.10.2014 um 11:54 schrieb Heikki Linnakangas <hlinnakangas@vmware.com>:
1. Where do the FF files come from? In 9.2, FF-segments are not supposed to created, ever. […]
2. Why are the .done files sometimes not being created?

We’ve encountered behaviour which seems to match what has been described here: On Streaming Replication slaves, there is an odd piling up of old WALs and .ready files in pg_xlog, going back several months.

The fine people on IRC have pointed me to this thread, and have encouraged me to revive it with our observations, so here we go:

Environment:

Master, 9.2.9
|- Slave S1, 9.2.9, on the same network as the master
'- Slave S2, 9.2.9, some 100 km away (occassional network hickups; *not* a cascading replication)

wal_keep_segments M=100 S1=100 S2=30
checkpoint_segments M=100 S1=30 S2=30
wal_level hot_standby (all)
archive_mode on (all)
archive_command on both slaves: /bin/true
archive_timeout 600s (all)

- On both slaves, we have „ghost“ WALs and corresponding .ready files (currently >600 of each on S2, slowly becoming a disk space problem)

- There’s always gaps in the ghost WAL names, often roughly 0x20, but not always

- The slave with the „bad“ network link has significantly more of these files, which suggests that disturbances of the Streaming Replication increase chances of triggering this bug; OTOH, the presence of a name gap pattern suggests the opposite

- We observe files named *FF as well

As you can see in the directory listings below, this setup is *very* low traffic, which may explain the pattern in WAL name gaps (?).

I’ve listed the entries by time, expecting to easily match WALs to their .ready files.
There sometimes is an interesting delay between the WAL’s mtime and the .ready file — especially for *FF, where there’s several days between the WAL and the .ready file.

- Master: http://pgsql.privatepaste.com/52ad612dfb
- Slave S1: http://pgsql.privatepaste.com/58b4f3bb10
- Slave S2: http://pgsql.privatepaste.com/a693a8d7f4

I’ve only skimmed through the thread; my understanding is that there were several patches floating around, but nothing was committed.
If there’s any way I can help, please let me know.

Yeah. It wasn't totally clear how all this should work, so I got distracted with other stuff an dropped the ball; sorry.

I'm thinking that we should change the behaviour on master so that the standby never archives any files from older timelines, only the new one that it generates itself. That will solve the immediate problem of old WAL files accumulating, and bogus .ready files appearing in the standby. However, it will not solve the bigger problem of how do you ensure that all WAL files are archived, when you promote a standby server. There is no guarantee on that today anyway, but this will make it even less reliable, because it will increase the chances that you miss a file on the old timeline in the archive, after promoting. I'd argue that that's a good thing; it makes the issue more obvious, so you are more likely to encounter it in testing, and you won't be surprised in an emergency. But I've started a new thread on that bigger issue, hopefully we'll come up with a solution (http://www.postgresql.org/message-id/548AF1CB.80702@vmware.com).

Now, what do we do with the back-branches? I'm not sure. Changing the behaviour in back-branches could cause nasty surprises. Perhaps it's best to just leave it as it is, even though it's buggy.

As long as master is fixed, I don't actually care. But I agree with Dennis that it's hard to see what's been commited with all the different issues found, and if any commits were done, in which branch. I'd like to be able to tell my customers: update to this minor release to see if it's fixed, but I can't even do that.

Guillaume.

http://blog.guillaume.lelarge.info

http://www.dalibo.com

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Michael Paquier

Date:

31 December 2014, 10:40:23

On Wed, Dec 31, 2014 at 4:44 PM, Guillaume Lelarge
<guillaume@lelarge.info> wrote:
> 2014-12-12 14:58 GMT+01:00 Heikki Linnakangas <hlinnakangas@vmware.com>:
>> Now, what do we do with the back-branches? I'm not sure. Changing the
>> behaviour in back-branches could cause nasty surprises. Perhaps it's best to
>> just leave it as it is, even though it's buggy.
>>
>
> As long as master is fixed, I don't actually care. But I agree with Dennis
> that it's hard to see what's been commited with all the different issues
> found, and if any commits were done, in which branch. I'd like to be able to
> tell my customers: update to this minor release to see if it's fixed, but I
> can't even do that.
This bug does not endanger at all data consistency as only old WAL
files remain on the standby, so I'm fine as well with just a clean fix
on master, and nothing done on back-branches.
-- 
Michael

Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)

From

Dennis Kögel

Date:

02 January 2015, 11:38:12

Am 31.12.2014 um 11:40 schrieb Michael Paquier <michael.paquier@gmail.com>:
>> As long as master is fixed, I don't actually care. But I agree with Dennis
>> that it's hard to see what's been commited with all the different issues
>> found, and if any commits were done, in which branch. I'd like to be able to
>> tell my customers: update to this minor release to see if it's fixed, but I
>> can't even do that.
> This bug does not endanger at all data consistency as only old WAL
> files remain on the standby, so I'm fine as well with just a clean fix
> on master, and nothing done on back-branches.

I’d like to point out that this issue is causing severe disk space bloat over time, as described in my previous
posting.I’d assume that, as time goes by, this issue will pop up more and more, as long as back-branches remain
unfixed.

And as the general recommendation is to never ever fiddle with pg_xlog/ contents, I’m not even sure what to do about
it.

- D.