Re: Design proposal: fsync absorb linear slider - Mailing list pgsql-hackers

From KONDO Mitsumasa
Subject Re: Design proposal: fsync absorb linear slider
Date
Msg-id 51F60608.3020902@lab.ntt.co.jp
Whole thread Raw
In response to Re: Design proposal: fsync absorb linear slider  (Greg Smith <greg@2ndQuadrant.com>)
Responses Re: Design proposal: fsync absorb linear slider
List pgsql-hackers
(2013/07/24 1:13), Greg Smith wrote:
> On 7/23/13 10:56 AM, Robert Haas wrote:
>> On Mon, Jul 22, 2013 at 11:48 PM, Greg Smith <greg@2ndquadrant.com> wrote:
>>> We know that a 1GB relation segment can take a really long time to write
>>> out.  That could include up to 128 changed 8K pages, and we allow all of
>>> them to get dirty before any are forced to disk with fsync.
>>
>> By my count, it can include up to 131,072 changed 8K pages.
>
> Even better!  I can pinpoint exactly what time last night I got tired enough to
> start making trivial mistakes.  Everywhere I said 128 it's actually 131,072,
> which just changes the range of the GUC I proposed.
I think that it is almost same as small dirty_background_ratio or 
dirty_background_bytes.
This method will be very bad performance, and many fsync() may be caused long fsync
situaition which was said past by you. My colleagues who are kernel expert say,
in executing fsync(), other process write same file a lot, it does not return fsync
call function occasionally. So too many fsync with large file is very dangerous.
Moreover fsync() also write metadata, it is worst for performance.

The essential improvement is not dirty page size in fsync() but scheduling of 
fsync phase.
I can't understand why postgres does not consider scheduling of fsync phase. When
dirty_background_ratio is big, in write phase does not write to disk at all,
therefore, fsync() is too heavy in fsync phase.


> Getting the number right really highlights just how bad the current situation
> is.  Would you expect the database to dump up to 128K writes into a file and then
> have low latency when it's flushed to disk with fsync?  Of course not.
I think that it will be improved this problem by sync_file_range() in fsync phase,
and adding checkpoint schedule in fsync phase. Executing small range 
sync_file_range()
and sleep, in final executing fsync(). I think it is better than your proposal.
If a system do not support sync_file_range() system call, it only execute fsync 
and sleep, it is same our method (you and I posted past).

Taken together my checkpoint proposal method,

* write phase  - Almost same, but considering fsync phase schedule.  - Considering case of background-write in OS, sort
bufferbefore starting 
 
checkpoint write.

* fsync phase  - Considering checkpoint schedule and write-phase schedule  - Executing separated sync_file_range() and
sleep,in final fsync().
 

And if I can, not write a buffer method which is called fsync() in a target file.
I think it may be quite difficult.

Best regards,
--
Mitsumasa KONDO
NTT Open Source Software Center





pgsql-hackers by date:

Previous
From: Dimitri Fontaine
Date:
Subject: Re: ALTER SYSTEM SET command to change postgresql.conf parameters (RE: Proposal for Allow postgresql.conf values to be changed via SQL [review])
Next
From: Szymon Guz
Date:
Subject: potential bug in error message in with clause