Re: WAL Re-Writes - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: WAL Re-Writes
Date
Msg-id CAA4eK1KG4pO5x5Z_Sum0u2FG66xowajz1qcoe=6+mY5_Q1x0+w@mail.gmail.com
Whole thread Raw
In response to Re: WAL Re-Writes  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: WAL Re-Writes
List pgsql-hackers
On Wed, Feb 3, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>
>> On 1/31/16 3:26 PM, Jan Wieck wrote:
>>>
>>> On 01/27/2016 08:30 AM, Amit Kapila wrote:
>>>>
>>>> operation.  Now why OS couldn't find the corresponding block in
>>>> memory is that, while closing the WAL file, we use
>>>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
>>>> lead to this problem.  So with this experiment, the conclusion is that
>>>> though we can avoid re-write of WAL data by doing exact writes, but
>>>> it could lead to significant reduction in TPS.
>>>
>>>
>>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
>>> from OS buffers. If I am not mistaken we recycle WAL segments in a round
>>> robin fashion. In a properly configured system, where the reason for a
>>> checkpoint is usually "time" rather than "xlog", a recycled WAL file
>>> written to had been closed and not touched for about a complete
>>> checkpoint_timeout or longer. You must have a really big amount of spare
>>> RAM in the machine to still find those blocks in memory. Basically we
>>> are talking about the active portion of your database, shared buffers,
>>> the sum of all process local memory and the complete pg_xlog directory
>>> content fitting into RAM.
>
>
>
> I think that could only be problem if reads were happening at write or
> fsync call, but that is not the case here.  Further investigation on this
> point reveals that the reads are not for fsync operation, rather they
> happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
> Although this behaviour (writing in non-OS-page-cache-size chunks could
> lead to reads if followed by a call to posix_fadvise
> (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
> reason for the same is that fadvise() call maps the specified data range
> (which in our case is whole file) into the list of pages and then invalidate
> them which will further lead to removing them from OS cache, now any
> misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file
> could cause additional reads as everything written by us will not be on
> OS-page-boundary.
>

On further testing, it has been observed that misaligned writes could
cause reads even when blocks related to file are not in-memory, so
I think what Jan is describing is right.  The case where there is
absolutely zero chance of reads is when we write in OS-page boundary
which is generally 4K.  However I still think it is okay to provide an
option for  WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)
for the cases when these are beneficial like when wal_level is
greater than equal to Archive and keep default as OS-page size if
the same is smaller than 8K.


With Regards,
Amit Kapila.

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data
Next
From: Robert Haas
Date:
Subject: Re: WAL Re-Writes