Re: WAL prefetch - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Re: WAL prefetch
Date
Msg-id 6e47f1fe-5bd3-0190-c3c1-69b8a291dd26@postgrespro.ru
Whole thread Raw
In response to Re: WAL prefetch  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: WAL prefetch
List pgsql-hackers

On 15.06.2018 07:36, Amit Kapila wrote:
> On Fri, Jun 15, 2018 at 12:16 AM, Stephen Frost <sfrost@snowman.net> wrote:
>>> I have tested wal_prefetch at two powerful servers with 24 cores, 3Tb NVME
>>> RAID 10 storage device and 256Gb of RAM connected using InfiniBand.
>>> The speed of synchronous replication between two nodes is increased from 56k
>>> TPS to 60k TPS (on pgbench with scale 1000).
>> I'm also surprised that it wasn't a larger improvement.
>>
>> Seems like it would make sense to implement in core using
>> posix_fadvise(), perhaps in the wal receiver and in RestoreArchivedFile
>> or nearby..  At least, that's the thinking I had when I was chatting w/
>> Sean.
>>
> Doing in-core certainly has some advantage such as it can easily reuse
> the existing xlog code rather trying to make a copy as is currently
> done in the patch, but I think it also depends on whether this is
> really a win in a number of common cases or is it just a win in some
> limited cases.
>
I am completely agree. It was my mail concern: on which use cases this 
prefetch will be efficient.
If "full_page_writes" is on (and it is safe and default value), then 
first update of a page since last checkpoint will be written in WAL as 
full page and applying it will not require reading any data from disk. 
If this pages is updated multiple times in subsequent transactions, then 
most likely it will be still present in OS file cache, unless checkpoint 
interval exceeds OS cache size (amount of free memory in the system). So 
if this conditions are satisfied then looks like prefetch is not needed. 
And it seems to be true for most real configurations: checkpoint 
interval is rarely set larger than hundred of gigabytes and modern 
servers usually have more RAM.

But once this condition is not satisfied and lag is larger than size of 
OS cache, then prefetch can be not efficient because prefetched pages 
may be thrown away from OS cache before them are actually accessed by 
redo process. In this case extra synchronization between prefetch and 
replay processes is needed so that prefetch is not moving too far away 
from replayed LSN.

It is not a problem to integrate this code in Postgres core and run it 
in background worker. I do not think that performing prefetch in wal 
receiver process itself is good idea: it may slow down speed of 
receiving changes from master. And in this case I really can throw away 
cut&pasted code. But it is easier to experiment with extension rather 
than with patch to Postgres core.
And I have published this extension to make it possible to perform 
experiments and check whether it is useful on real workloads.


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



pgsql-hackers by date:

Previous
From: Rajkumar Raghuwanshi
Date:
Subject: Getting "ERROR: did not find all requested child rels inappend_rel_list" when enable_partition_pruning is on
Next
From: Amit Langote
Date:
Subject: Re: Getting "ERROR: did not find all requested child rels inappend_rel_list" when enable_partition_pruning is on