Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves) - Mailing list pgsql-hackers

From Fujii Masao
Subject Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
Date
Msg-id CAHGQGwEo8PuVjXc1=ym96yuz2YCpsmOkfythDi6inQEbUzD7RA@mail.gmail.com
Whole thread Raw
In response to Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)  (Heikki Linnakangas <hlinnakangas@vmware.com>)
List pgsql-hackers
On Thu, Oct 23, 2014 at 5:09 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 10/23/2014 08:59 AM, Fujii Masao wrote:
>>
>> On Mon, Oct 20, 2014 at 3:26 PM, Michael Paquier
>> <michael.paquier@gmail.com> wrote:
>>>
>>> On Fri, Oct 17, 2014 at 10:37 PM, Michael Paquier
>>> <michael.paquier@gmail.com> wrote:
>>>>
>>>>
>>>> On Fri, Oct 17, 2014 at 9:23 PM, Fujii Masao <masao.fujii@gmail.com>
>>>> wrote:
>>>>>
>>>>>
>>>>> In this case, the patch seems to make the restartpoint recycle even WAL
>>>>> files
>>>>> which have .ready files and will have to be archived later. Thought?
>>>>
>>>>
>>>> The real problem currently is that it is possible to have a segment file
>>>> not marked as .done during recovery when stream connection is abruptly cut
>>>> when this segment is switched, marking it as .ready in archive_status and
>>>> simply letting this segment in pg_xlog because it will neither be recycled
>>>> nor removed. I have not been able to look much at this code these days, so I
>>>> am not sure how invasive it would be in back-branches, but perhaps we should
>>>> try to improve code such as when a segment file is switched and connection
>>>> to the is cut, we guarantee that this file is completed and marked as .done.
>>>
>>>
>>> I have spent more time on that, with a bit more of underground work...
>>> First, the problem can be reproduced most of the time by running this
>>> simple command:
>>> psql -c 'select pg_switch_xlog()'; pg_ctl restart -m immediate
>>
>>
>> What about fixing this problem directly? That is, we can make walreceiver
>> check whether the end of last received WAL data is the end of current WAL
>> file
>> or not, and then close the WAL file and create .done file if the test is
>> true.
>>
>> This is not a perfect solution. If the standby crashes during very
>> short interval
>> (i.e., after closing the WAL file and before creating .done file), the
>> problem
>> would happen again. But it can really rarely happen, so I don't think that
>> it's
>> worth fixing the corner case at least in back-branches. Of course, we can
>> find out the "perfect" solution for the master, though.
>
>
> Sounds reasonable, for back-branches. Although I'm still worried we might
> miss some corner-case unless we go with a more wholesale solution.

+1

> At least for master, we should consider changing the way the archiving works
> so that we only archive WAL that was generated in the same server. I.e. we
> should never try to archive WAL files belonging to another timeline.
>
> I just remembered that we discussed a different problem related to this some
> time ago, at
> http://www.postgresql.org/message-id/20131212.110002.204892575.horiguchi.kyotaro@lab.ntt.co.jp.
> The conclusion of that was that at promotion, we should not archive the
> last, partial, segment from the old timeline.

So, the last, partial, segment of the old timeline is never archived?
If yes, I'm afraid that the PITR to the old timeline cannot replay the
last segment. No? Or you're thinking to change the code so that
the segment of new timeline is replayed in that case?

> In summary, let's do something small for back-branches, like what you
> suggested. But for master, let's do bigger changes to the timeline handling.

Yep.

Regards,

-- 
Fujii Masao



pgsql-hackers by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)
Next
From: Fujii Masao
Date:
Subject: Re: BUG: *FF WALs under 9.2 (WAS: .ready files appearing on slaves)