Re: WAL prefetch - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WAL prefetch
Date
Msg-id 75102f8c-3659-2c74-c9fc-8fbf70d5b525@2ndquadrant.com
Whole thread Raw
In response to Re: WAL prefetch  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: WAL prefetch
List pgsql-hackers
On 06/19/2018 02:33 PM, Konstantin Knizhnik wrote:
> 
> On 19.06.2018 14:03, Tomas Vondra wrote:
>>
>> On 06/19/2018 11:08 AM, Konstantin Knizhnik wrote:
>>>
>>> ...
 >>>
>>> Also there are two points which makes prefetching into shared buffers 
>>> more complex:
>>> 1. Need to spawn multiple workers to make prefetch in parallel and 
>>> somehow distribute work between them.
>>> 2. Synchronize work of recovery process with prefetch to prevent 
>>> prefetch to go too far and doing useless job.
>>> The same problem exists for prefetch in OS cache, but here risk of 
>>> false prefetch is less critical.
>>>
>>
>> I think the main challenge here is that all buffer reads are currently 
>> synchronous (correct me if I'm wrong), while the posix_fadvise() 
>> allows a to prefetch the buffers asynchronously.
> 
> Yes, this is why we have to spawn several concurrent background workers 
> to perfrom prefetch.

Right. My point is that while spawning bgworkers probably helps, I don't 
expect it to be enough to fill the I/O queues on modern storage systems. 
Even if you start say 16 prefetch bgworkers, that's not going to be 
enough for large arrays or SSDs. Those typically need way more than 16 
requests in the queue.

Consider for example [1] from 2014 where Merlin reported how S3500 
(Intel SATA SSD) behaves with different effective_io_concurrency values:

[1] 
https://www.postgresql.org/message-id/CAHyXU0yiVvfQAnR9cyH=HWh1WbLRsioe=mzRJTHwtr=2azsTdQ@mail.gmail.com

Clearly, you need to prefetch 32/64 blocks or so. Consider you may have 
multiple such devices in a single RAID array, and that this device is 
from 2014 (and newer flash devices likely need even deeper queues).

ISTM a small number of bgworkers is not going to be sufficient. It might 
be enough for WAL prefetching (where we may easily run into the 
redo-is-single-threaded bottleneck), but it's hardly a solution for 
bitmap heap scans, for example. We'll need to invent something else for 
that.

OTOH my guess is that whatever solution we'll end up implementing for 
bitmap heap scans, it will be applicable for WAL prefetching too. Which 
is why I'm suggesting simply using posix_fadvise is not going to make 
the direct I/O patch significantly more complicated.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Etsuro Fujita
Date:
Subject: Re: Expression errors with "FOR UPDATE" and postgres_fdw with partitionwise join enabled.
Next
From: Jeremy Finzel
Date:
Subject: Re: found xmin from before relfrozenxid on pg_catalog.pg_authid