Home > mailing lists

Re: WAL Re-Writes - Mailing list pgsql-hackers

From	Amit Kapila
Subject	Re: WAL Re-Writes
Date	February 3, 2016 15:28:09
Msg-id	CAA4eK1KG4pO5x5Z_Sum0u2FG66xowajz1qcoe=6+mY5_Q1x0+w@mail.gmail.com Whole thread Raw
In response to	Re: WAL Re-Writes (Amit Kapila <amit.kapila16@gmail.com>)
Responses	Re: WAL Re-Writes
List	pgsql-hackers

Tree view

On Wed, Feb 3, 2016 at 11:12 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Feb 1, 2016 at 8:05 PM, Jim Nasby <Jim.Nasby@bluetreble.com> wrote:
>>
>> On 1/31/16 3:26 PM, Jan Wieck wrote:
>>>
>>> On 01/27/2016 08:30 AM, Amit Kapila wrote:
>>>>
>>>> operation. Now why OS couldn't find the corresponding block in
>>>> memory is that, while closing the WAL file, we use
>>>> POSIX_FADV_DONTNEED if wal_level is less than 'archive' which
>>>> lead to this problem. So with this experiment, the conclusion is that
>>>> though we can avoid re-write of WAL data by doing exact writes, but
>>>> it could lead to significant reduction in TPS.
>>>
>>>
>>> POSIX_FADV_DONTNEED isn't the only way how those blocks would vanish
>>> from OS buffers. If I am not mistaken we recycle WAL segments in a round
>>> robin fashion. In a properly configured system, where the reason for a
>>> checkpoint is usually "time" rather than "xlog", a recycled WAL file
>>> written to had been closed and not touched for about a complete
>>> checkpoint_timeout or longer. You must have a really big amount of spare
>>> RAM in the machine to still find those blocks in memory. Basically we
>>> are talking about the active portion of your database, shared buffers,
>>> the sum of all process local memory and the complete pg_xlog directory
>>> content fitting into RAM.
>
>
>
> I think that could only be problem if reads were happening at write or
> fsync call, but that is not the case here. Further investigation on this
> point reveals that the reads are not for fsync operation, rather they
> happen when we call posix_fadvise(,,POSIX_FADV_DONTNEED).
> Although this behaviour (writing in non-OS-page-cache-size chunks could
> lead to reads if followed by a call to posix_fadvise
> (,,POSIX_FADV_DONTNEED)) is not very clearly documented, but the
> reason for the same is that fadvise() call maps the specified data range
> (which in our case is whole file) into the list of pages and then invalidate
> them which will further lead to removing them from OS cache, now any
> misaligned (w.r.t OS page-size) writes done during writing/fsyncing to file
> could cause additional reads as everything written by us will not be on
> OS-page-boundary.
>

On further testing, it has been observed that misaligned writes could

cause reads even when blocks related to file are not in-memory, so

I think what Jan is describing is right. The case where there is

absolutely zero chance of reads is when we write in OS-page boundary

which is generally 4K. However I still think it is okay to provide an

option for WAL writing in smaller chunks (512 bytes , 1024 bytes, etc)

for the cases when these are beneficial like when wal_level is

greater than equal to Archive and keep default as OS-page size if

the same is smaller than 8K.

With Regards,
Amit Kapila.

EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

From: Thomas Munro
Date: 03 February 2016, 13:46:39
Subject: Re: Proposal: "Causal reads" mode for load balancing reads without stale data

From: Robert Haas
Date: 03 February 2016, 16:42:47
Subject: Re: WAL Re-Writes

Re: WAL Re-Writes - Mailing list pgsql-hackers

Previous

Next