Thread: Speed up the removal of WAL files

Speed up the removal of WAL files

From
"Tsunakawa, Takayuki"
Date:
Hello,

The attached patch speeds up the removal of WAL files in the old timelines.  I'll add this to the next CF.


BACKGROUND
==================================================

We need to meet a severe availability requirement of a potential customer.  They will use synchronous streaming
replication. The allowed failover duration, from the failure through failure detection to the failover completion, is
10seconds.  Even one second is precious.
 

During a testing on a fast machine with SSD, we observed about 2 seconds between these messages.  There were no other
messagesbetween them.
 

LOG:  archive recovery complete
LOG:  MultiXact member wraparound protections are now enabled


CAUSE
==================================================

Examining the source code, RemoveNonParentXlogFiles() seems to account for the time.  It syncs pg_wal directory every
timeit deletes a WAL file.  max_wal_size was set to 48GB, so about 1,000 WAL files were probably deleted and hence the
pg_waldirectory was synced as much.
 


FIX
==================================================

unlink() the WAL files, then sync the pg_wal directory once at the end.

Unfortunately, the original machine is now not available, so I confirmed the speedup on a VM with HDD.

[time to remove 1,000 WAL files including the directory sync]
nonpatched: 2.45 seconds
patched:    0.81 seconds


Regards
Takayuki Tsunakawa


Attachment

Re: Speed up the removal of WAL files

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 17 Nov 2017 06:35:41 +0000, "Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> wrote in
<0A3221C70F24FB45833433255569204D1F81B0C8@G01JPEXMBYT05>
> Hello,
> 
> The attached patch speeds up the removal of WAL files in the old timelines.  I'll add this to the next CF.
> 
> 
> BACKGROUND
> ==================================================
> 
> We need to meet a severe availability requirement of a potential customer.  They will use synchronous streaming
replication. The allowed failover duration, from the failure through failure detection to the failover completion, is
10seconds.  Even one second is precious.
 
> 
> During a testing on a fast machine with SSD, we observed about 2 seconds between these messages.  There were no other
messagesbetween them.
 
> 
> LOG:  archive recovery complete
> LOG:  MultiXact member wraparound protections are now enabled
> 
> 
> CAUSE
> ==================================================
> 
> Examining the source code, RemoveNonParentXlogFiles() seems to account for the time.  It syncs pg_wal directory every
timeit deletes a WAL file.  max_wal_size was set to 48GB, so about 1,000 WAL files were probably deleted and hence the
pg_waldirectory was synced as much.
 
> 
> 
> FIX
> ==================================================
> 
> unlink() the WAL files, then sync the pg_wal directory once at the end.
> 
> Unfortunately, the original machine is now not available, so I confirmed the speedup on a VM with HDD.
> 
> [time to remove 1,000 WAL files including the directory sync]
> nonpatched: 2.45 seconds
> patched:    0.81 seconds
> 
> 
> Regards
> Takayuki Tsunakawa

The orinal code recycles some of the to-be-removed files, but the
patch removes all the victims.  This may impact on performance.

Likewise the original code is using durable_unlink to actually
remove a file so separating unlink and fsync might resurrect the
problem that should have been fixed by
1b02be21f271db6bd3cd43abb23fa596fcb6bac3 (I'm not sure what it
was but you are one of the reviwers of it). I suppose that you
need to explain the reason why this change doesn't risk anything.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



RE: Speed up the removal of WAL files

From
"Tsunakawa, Takayuki"
Date:
From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> The orinal code recycles some of the to-be-removed files, but the patch
> removes all the victims.  This may impact on performance.

Yes, I noticed it after submitting the patch and was wondering what to do.  Thinking simply, I think it would be just
enoughto replace durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and sync the pg_wal directory
oncein RemoveNonParentXlogFiles() and RemoveOldXlogFiles().  This will benefit the failover time when fast promotion is
notperformed.  What do you think?
 

BTW, RemoveNonParentXlogFiles() recycles only 10 WAL files and delete all other files.  So the impact of modification
islimited.
 


> Likewise the original code is using durable_unlink to actually remove a
> file so separating unlink and fsync might resurrect the problem that should
> have been fixed by
> 1b02be21f271db6bd3cd43abb23fa596fcb6bac3 (I'm not sure what it was but you
> are one of the reviwers of it). I suppose that you need to explain the reason
> why this change doesn't risk anything.

It's safe because the directory is finally synced.  If the database server crashes before it deletes all WAL files, it
performsthe deletion again during the next recovery.
 

Regards
Takayuki Tsunakawa






Re: Speed up the removal of WAL files

From
Fujii Masao
Date:
On Fri, Nov 17, 2017 at 5:20 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
>> The orinal code recycles some of the to-be-removed files, but the patch
>> removes all the victims.  This may impact on performance.
>
> Yes, I noticed it after submitting the patch and was wondering what to do.  Thinking simply, I think it would be just
enoughto replace durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and sync the pg_wal directory
oncein RemoveNonParentXlogFiles() and RemoveOldXlogFiles().  This will benefit the failover time when fast promotion is
notperformed.  What do you think? 

It seems not good idea to just replace durable_rename() with rename()
in RemoveOldXlogFiles()->RemoveXlogFiles()->InstallXLogFileSegment().
Because that change seems to be able to cause the following problem.

1. Checkpoint calls RemoveOldXlogFiles().
2. It recycles the WAL file AAA to BBB. pg_wal directory has not fsync'd yet.
3. Another transaction (TX1) writes its WAL data into the (recycled) file BBB.
4. CRASH and RESTART
5. The WAL file BBB disappears and you can see AAA,   but AAA is not used in recovery. This causes data loss of
transaction by Tx1.

Regards,

--
Fujii Masao


Re: Speed up the removal of WAL files

From
Michael Paquier
Date:
On Sat, Nov 18, 2017 at 2:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> On Fri, Nov 17, 2017 at 5:20 PM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
>> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
>>> The orinal code recycles some of the to-be-removed files, but the patch
>>> removes all the victims.  This may impact on performance.
>>
>> Yes, I noticed it after submitting the patch and was wondering what to do.  Thinking simply, I think it would be
justenough to replace durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and sync the pg_wal
directoryonce in RemoveNonParentXlogFiles() and RemoveOldXlogFiles().  This will benefit the failover time when fast
promotionis not performed.  What do you think? 

Note that durable_rename() also flushes the origin file to make sure
that it does not show up again after a crash.

> It seems not good idea to just replace durable_rename() with rename()
> in RemoveOldXlogFiles()->RemoveXlogFiles()->InstallXLogFileSegment().
> Because that change seems to be able to cause the following problem.

If archiving is enabled, RemoveOldXlogFiles() would create a .ready
flag for all segments that have reappeared. Oops, it archived again.
--
Michael


RE: Speed up the removal of WAL files

From
"Tsunakawa, Takayuki"
Date:
Thanks, it seems to require a bit more consideration about RemoveOldXLogFiles().  Let me continue this next month.


Regards
Takayuki Tsunakawa


> -----Original Message-----
> From: Michael Paquier [mailto:michael.paquier@gmail.com]
> Sent: Saturday, November 18, 2017 10:37 PM
> To: Fujii Masao
> Cc: Tsunakawa, Takayuki/綱川 貴之; Kyotaro HORIGUCHI;
> pgsql-hackers@postgresql.org
> Subject: Re: Speed up the removal of WAL files
> 
> On Sat, Nov 18, 2017 at 2:57 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > On Fri, Nov 17, 2017 at 5:20 PM, Tsunakawa, Takayuki
> > <tsunakawa.takay@jp.fujitsu.com> wrote:
> >> From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> >>> The orinal code recycles some of the to-be-removed files, but the
> >>> patch removes all the victims.  This may impact on performance.
> >>
> >> Yes, I noticed it after submitting the patch and was wondering what to
> do.  Thinking simply, I think it would be just enough to replace
> durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and
> sync the pg_wal directory once in RemoveNonParentXlogFiles() and
> RemoveOldXlogFiles().  This will benefit the failover time when fast
> promotion is not performed.  What do you think?
> 
> Note that durable_rename() also flushes the origin file to make sure that
> it does not show up again after a crash.
> 
> > It seems not good idea to just replace durable_rename() with rename()
> > in RemoveOldXlogFiles()->RemoveXlogFiles()->InstallXLogFileSegment().
> > Because that change seems to be able to cause the following problem.
> 
> If archiving is enabled, RemoveOldXlogFiles() would create a .ready flag
> for all segments that have reappeared. Oops, it archived again.


RE: Speed up the removal of WAL files

From
"Tsunakawa, Takayuki"
Date:
From: Fujii Masao [mailto:masao.fujii@gmail.com]
> On Fri, Nov 17, 2017 at 5:20 PM, Tsunakawa, Takayuki
> <tsunakawa.takay@jp.fujitsu.com> wrote:
> > Yes, I noticed it after submitting the patch and was wondering what to
> do.  Thinking simply, I think it would be just enough to replace
> durable_unlink/durable_rename in RemoveXLogFile() with unlink/rename, and
> sync the pg_wal directory once in RemoveNonParentXlogFiles() and
> RemoveOldXlogFiles().  This will benefit the failover time when fast
> promotion is not performed.  What do you think?
> 
> It seems not good idea to just replace durable_rename() with rename() in
> RemoveOldXlogFiles()->RemoveXlogFiles()->InstallXLogFileSegment().
> Because that change seems to be able to cause the following problem.
> 
> 1. Checkpoint calls RemoveOldXlogFiles().
> 2. It recycles the WAL file AAA to BBB. pg_wal directory has not fsync'd
> yet.
> 3. Another transaction (TX1) writes its WAL data into the (recycled) file
> BBB.
> 4. CRASH and RESTART
> 5. The WAL file BBB disappears and you can see AAA,
>     but AAA is not used in recovery. This causes data loss of transaction
> by Tx1.

Right.  Then I thought of doing the following to avoid making a new function only for RemoveNonParentXlogFiles() which
issimilar to RemoveXlogFile().
 

* Add an argument "bool durable" to RemoveXlogFile().  Based on its value, RemoveXlogFile() calls either durable_xx()
orxx().
 

* Likewise, add an argument "bool durable" to InstallXLogFileSegment() and do the same.

* RemoveNonParentXlogFiles() calls RemoveXlogFile() with durable=false.  At the end of the function, sync the pg_wal
directorywith fsync_fname().
 

* RemoveOldXlogFiles() does the same thing.  One difference is that it passes false to RemoveXlogFile() during recovery
(InRecovery== true) and true otherwise.
 

But this requires InstallXLogFileSegment() to also have an argument "bool durable."  That means
InstallXLogFileSegment()and RemoveXlogFile() have to include something like below in several places, which is dirty:
 

    if (durable)
        rc = durable_xx(path, elevel);
    else
        rc = xx(path);
    if (rc != 0)
    {
        /* durable_xx() output the message */
        if (!durable)
            ereport(elevel, ...);
    }

Do you think of a cleaner way?


From: Kyotaro HORIGUCHI [mailto:horiguchi.kyotaro@lab.ntt.co.jp]
> The orinal code recycles some of the to-be-removed files, but the
> patch removes all the victims.  This may impact on performance.

At most only 10 WAL files are recycled.  If there's no good solution to the above matter, can we compromise with the
currentpatch?
 

    if (PriorRedoPtr == InvalidXLogRecPtr)
        recycleSegNo = endlogSegNo + 10;

> Likewise the original code is using durable_unlink to actually
> remove a file so separating unlink and fsync might resurrect the
> problem that should have been fixed by
> 1b02be21f271db6bd3cd43abb23fa596fcb6bac3 (I'm not sure what it
> was but you are one of the reviwers of it). I suppose that you
> need to explain the reason why this change doesn't risk anything.

If the host crashes while running RemoveNonParentXlogFiles() and some effects of the function are lost, the next
recoverywill do them again.
 

Regards
Takayuki Tsunakawa




Re: Speed up the removal of WAL files

From
Michael Paquier
Date:
On Wed, Feb 21, 2018 at 07:20:00AM +0000, Tsunakawa, Takayuki wrote:
> Right.  Then I thought of doing the following to avoid making a new
> function only for RemoveNonParentXlogFiles() which is similar to
> RemoveXlogFile().
>
> * Add an argument "bool durable" to RemoveXlogFile().  Based on its
> value, RemoveXlogFile() calls either durable_xx() or xx().
>
> * Likewise, add an argument "bool durable" to InstallXLogFileSegment()
> and do the same.
>
> * RemoveNonParentXlogFiles() calls RemoveXlogFile() with
> durable=false.  At the end of the function, sync the pg_wal directory
> with fsync_fname().
>
> * RemoveOldXlogFiles() does the same thing.  One difference is that it
> passes false to RemoveXlogFile() during recovery (InRecovery == true)
> and true otherwise.

It seems to me that you would reintroduce partially the problems that
1d4a0ab1 has fixed.  In short, if a crash happens in the code paths
calling RemoveXlogFile with durable = false before fsync'ing pg_wal,
then a rename has no guarantee to be durable, so you could finish again
with a file that as an old name, but new contents.  A crucial thing
which matters for a rename to be durable is that the old file is sync'ed
as well.
--
Michael

Attachment

RE: Speed up the removal of WAL files

From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael@paquier.xyz]
It seems to me that you would reintroduce partially the problems that
> 1d4a0ab1 has fixed.  In short, if a crash happens in the code paths calling
> RemoveXlogFile with durable = false before fsync'ing pg_wal, then a rename
> has no guarantee to be durable, so you could finish again with a file that
> as an old name, but new contents.  A crucial thing which matters for a rename

Hmm, you're right.  Even during recovery, RemoveOldXlogFiles() can't skip fsyncing pg_wal/ because new WAL records
streamedfrom the master are written to recycled WAL files.
 

After all, it seems to me that we have to stand with the current patch which only handles RemoveNonParentXlogFiles().


Regards
Takayuki Tsunakawa





Re: Speed up the removal of WAL files

From
Fujii Masao
Date:
On Wed, Feb 21, 2018 at 5:27 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:
> From: Michael Paquier [mailto:michael@paquier.xyz]
> It seems to me that you would reintroduce partially the problems that
>> 1d4a0ab1 has fixed.  In short, if a crash happens in the code paths calling
>> RemoveXlogFile with durable = false before fsync'ing pg_wal, then a rename
>> has no guarantee to be durable, so you could finish again with a file that
>> as an old name, but new contents.  A crucial thing which matters for a rename
>
> Hmm, you're right.  Even during recovery, RemoveOldXlogFiles() can't skip fsyncing pg_wal/ because new WAL records
streamedfrom the master are written to recycled WAL files.
 
>
> After all, it seems to me that we have to stand with the current patch which only handles
RemoveNonParentXlogFiles().

But the approach that the patch uses would cause the performance problem
as Horiguchi-san pointed out upthread.

So, what about, as another approach, making the checkpointer instead of
the startup process call RemoveNonParentXlogFiles() when end-of-recovery
checkpoint is executed? ISTM that a recovery doesn't need to wait for
RemoveNonParentXlogFiles() to end. Instead, RemoveNonParentXlogFiles()
seems to have to complete before the checkpointer calls RemoveOldXlogFiles()
and creates .ready files for the "garbage" WAL files on the old timeline.
So it seems natual to leave that WAL recycle task to the checkpointer.

Regards,

-- 
Fujii Masao


Re: Speed up the removal of WAL files

From
Michael Paquier
Date:
On Wed, Mar 07, 2018 at 12:55:43AM +0900, Fujii Masao wrote:
> So, what about, as another approach, making the checkpointer instead of
> the startup process call RemoveNonParentXlogFiles() when end-of-recovery
> checkpoint is executed? ISTM that a recovery doesn't need to wait for
> RemoveNonParentXlogFiles() to end. Instead, RemoveNonParentXlogFiles()
> seems to have to complete before the checkpointer calls RemoveOldXlogFiles()
> and creates .ready files for the "garbage" WAL files on the old timeline.
> So it seems natual to leave that WAL recycle task to the checkpointer.

Couldn't that impact the I/O performance at the end of recovery until
the first post-recovery checkpoint is completed?  Let's not forget that
since 9.3 the end-of-recovery checkpoint is not triggered immediately,
so there could be a delay.  If WAL segments of the past timeline are
recycled without waiting for this first checkpoint to happen then there
is no need to create new, zero-emptied, segments post-recovery, which
can count as well.
--
Michael

Attachment

RE: Speed up the removal of WAL files

From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael@paquier.xyz]
> On Wed, Mar 07, 2018 at 12:55:43AM +0900, Fujii Masao wrote:
> > So, what about, as another approach, making the checkpointer instead
> > of the startup process call RemoveNonParentXlogFiles() when
> > end-of-recovery checkpoint is executed? ISTM that a recovery doesn't
> > need to wait for
> > RemoveNonParentXlogFiles() to end. Instead, RemoveNonParentXlogFiles()
> > seems to have to complete before the checkpointer calls
> > RemoveOldXlogFiles() and creates .ready files for the "garbage" WAL files
> on the old timeline.
> > So it seems natual to leave that WAL recycle task to the checkpointer.
> 
> Couldn't that impact the I/O performance at the end of recovery until the
> first post-recovery checkpoint is completed?  Let's not forget that since
> 9.3 the end-of-recovery checkpoint is not triggered immediately, so there
> could be a delay.  If WAL segments of the past timeline are recycled without
> waiting for this first checkpoint to happen then there is no need to create
> new, zero-emptied, segments post-recovery, which can count as well.

Good point.  I understood you referred to PreallocXlogFiles(), which may create one new WAL file if
RemoveNonParentXlogFiles()is not called or does not recycle WAL files in the old timeline.
 

The best hack (or a compromise/kludge?) seems to be:

1. Modify durable_xx() functions so that they don't fsync directory hanges when enableFsync is false.

2. RemoveNonParentXlogFiles() sets enableFsync to false before the while loop, restores the original value of it after
thewhile loop, and fsync pg_wal/ just once.
 
What do you think?


Regards
Takayuki Tsunakawa





Re: Speed up the removal of WAL files

From
Michael Paquier
Date:
On Wed, Mar 07, 2018 at 06:15:24AM +0000, Tsunakawa, Takayuki wrote:
> Good point.  I understood you referred to PreallocXlogFiles(), which
> may create one new WAL file if RemoveNonParentXlogFiles() is not
> called or does not recycle WAL files in the old timeline.
>
> The best hack (or a compromise/kludge?) seems to be:
>
> 1. Modify durable_xx() functions so that they don't fsync directory
> hanges when enableFsync is false.
>
> 2. RemoveNonParentXlogFiles() sets enableFsync to false before the
> while loop, restores the original value of it after the while loop,
> and fsync pg_wal/ just once.   What do you think?

Hm.  durable_xx should remain a sane operation as an isolated call as
you still get the same problem if a crash happens before flushing the
parent...  Fujii-san idea also has value to speed up the end of recovery
but this costs as well in extra recycling operations post promotion.  If
the checkpoint was to happen a the end of recovery then that would be
more logic, but we don't for performance reasons.  Let's continue to
discuss on this thread.  If you have any patch to offer, let's also look
at them.

Anyway, as things are pretty much idle on this thread for a couple of
days and that we are still discussing potential ideas, I think that this
entry should be marked as returned with feedback.  Thoughts?
--
Michael

Attachment

RE: Speed up the removal of WAL files

From
"Tsunakawa, Takayuki"
Date:
From: Michael Paquier [mailto:michael@paquier.xyz]
> Hm.  durable_xx should remain a sane operation as an isolated call as you
> still get the same problem if a crash happens before flushing the parent...
> Fujii-san idea also has value to speed up the end of recovery but this costs
> as well in extra recycling operations post promotion.  If the checkpoint
> was to happen a the end of recovery then that would be more logic, but we
> don't for performance reasons.  Let's continue to discuss on this thread.
> If you have any patch to offer, let's also look at them.
> 
> Anyway, as things are pretty much idle on this thread for a couple of days
> and that we are still discussing potential ideas, I think that this entry
> should be marked as returned with feedback.  Thoughts?

OK, I moved this to the next CF.  Thank you for your cooperation.



Regards
Takayuki Tsunakawa




Re: Speed up the removal of WAL files

From
Fujii Masao
Date:
On Wed, Feb 21, 2018 at 4:52 PM, Michael Paquier <michael@paquier.xyz> wrote:
> On Wed, Feb 21, 2018 at 07:20:00AM +0000, Tsunakawa, Takayuki wrote:
>> Right.  Then I thought of doing the following to avoid making a new
>> function only for RemoveNonParentXlogFiles() which is similar to
>> RemoveXlogFile().
>>
>> * Add an argument "bool durable" to RemoveXlogFile().  Based on its
>> value, RemoveXlogFile() calls either durable_xx() or xx().
>>
>> * Likewise, add an argument "bool durable" to InstallXLogFileSegment()
>> and do the same.
>>
>> * RemoveNonParentXlogFiles() calls RemoveXlogFile() with
>> durable=false.  At the end of the function, sync the pg_wal directory
>> with fsync_fname().
>>
>> * RemoveOldXlogFiles() does the same thing.  One difference is that it
>> passes false to RemoveXlogFile() during recovery (InRecovery == true)
>> and true otherwise.
>
> It seems to me that you would reintroduce partially the problems that
> 1d4a0ab1 has fixed.  In short, if a crash happens in the code paths
> calling RemoveXlogFile with durable = false before fsync'ing pg_wal,
> then a rename has no guarantee to be durable, so you could finish again
> with a file that as an old name, but new contents.

ISTM that this trouble will never happen because basically no one writes
new contents in the renamed WAL files while RemoveNonParentXlogFiles()
is renaming the WAL files at the end of recovery. No?

Regards,

-- 
Fujii Masao


Re: Speed up the removal of WAL files

From
Michael Paquier
Date:
On Fri, Mar 09, 2018 at 12:50:03AM +0000, Tsunakawa, Takayuki wrote:
> From: Michael Paquier [mailto:michael@paquier.xyz]
>> Hm.  durable_xx should remain a sane operation as an isolated call as you
>> still get the same problem if a crash happens before flushing the parent...
>> Fujii-san idea also has value to speed up the end of recovery but this costs
>> as well in extra recycling operations post promotion.  If the checkpoint
>> was to happen a the end of recovery then that would be more logic, but we
>> don't for performance reasons.  Let's continue to discuss on this thread.
>> If you have any patch to offer, let's also look at them.
>>
>> Anyway, as things are pretty much idle on this thread for a couple of days
>> and that we are still discussing potential ideas, I think that this entry
>> should be marked as returned with feedback.  Thoughts?
>
> OK, I moved this to the next CF.  Thank you for your cooperation.

The patch is still roughly with this status, so I am marking it as
returned with feedback.
--
Michael

Attachment