Re: WAL prefetch - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: WAL prefetch
Date
Msg-id 6830ae55-5a93-acdf-1fdd-58fce53f7f27@2ndquadrant.com
Whole thread Raw
In response to Re: WAL prefetch  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: WAL prefetch
List pgsql-hackers
On 06/19/2018 06:34 PM, Konstantin Knizhnik wrote:
> 
> 
> On 19.06.2018 18:50, Andres Freund wrote:
>> On 2018-06-19 12:08:27 +0300, Konstantin Knizhnik wrote:
>>> I do not think that prefetching in shared buffers requires much more 
>>> efforts
>>> and make patch more envasive...
>>> It even somehow simplify it, because there is no to maintain own 
>>> cache of
>>> prefetched pages...
>>> But it will definitely have much more impact on Postgres performance:
>>> contention for buffer locks, throwing away pages accessed by read-only
>>> queries,...
>> These arguments seem bogus to me. Otherwise the startup process is going
>> to do that work.
> 
> There is just one process replaying WAL. Certainly it has some impact on 
> hot standby query execution.
> But if there will be several prefetch workers (128???) then this impact 
> will be dramatically increased.
> 

The goal of prefetching is better saturation of the storage. Which means 
less bandwidth remaining for other processes (that have to compete for 
the same storage). I don't think "startup process is going to do that 
work" is entirely true - it'd do that work, but likely over longer 
period of time.

But I don't think this is an issue - I'd expect having some GUC defining 
how many records to prefetch (just like effective_io_concurrency).

>>> Concerning WAL perfetch I still have a serious doubt if it is needed 
>>> at all:
>>> if checkpoint interval is less than size of free memory at the 
>>> system, then
>>> redo process should not read much.
>> I'm confused. Didn't you propose this?  FWIW, there's a significant
>> number of installations where people have observed this problem in
>> practice.
> 
> Well, originally it was proposed by Sean - the author of pg-prefaulter. 
> I just ported it from GO to C using standard PostgreSQL WAL iterator.
> Then I performed some measurements and didn't find some dramatic 
> improvement in performance (in case of synchronous replication) or 
> reducing replication lag for asynchronous replication neither at my 
> desktop (SSD, 16Gb RAM, local replication within same computer, pgbench 
> scale 1000), neither at pair of two powerful servers connected by
> InfiniBand and 3Tb NVME (pgbench with scale 100000).
> Also I noticed that read rate at replica is almost zero.
> What does it mean:
> 1. I am doing something wrong.
> 2. posix_prefetch is not so efficient.
> 3. pgbench is not right workload to demonstrate effect of prefetch.
> 4. Hardware which I am using is not typical.
> 

pgbench is a perfectly sufficient workload to demonstrate the issue, all 
you need to do is use sufficiently large scale factor (say 2*RAM) and 
large number of clients to generate writes on the primary (to actually 
saturate the storage). Then the redo on replica won't be able to keep 
up, because the redo only fetches one page at a time.

> So it make me think when such prefetch may be needed... And it caused 
> new questions:
> I wonder how frequently checkpoint interval is much larger than OS cache?

Pretty often. Furthermore, replicas may also run queries (often large 
ones), pushing pages related to redo from RAM.

> If we enforce full pages writes (let's say each after each 1Gb), how it 
> affect wal size and performance?
> 

It would improve redo performance, of course, exactly because the page 
would not need to be loaded from disk. But the amount of WAL can 
increase tremendously, causing issues for network bandwidth 
(particularly between different data centers).

> Looks like it is difficult to answer the second question without 
> implementing some prototype.
> May be I will try to do it.

Perhaps you should prepare some examples of workloads demonstrating the 
issue, before trying implementing a solution.

>>> And if checkpoint interval is much larger than OS cache (are there cases
>>> when it is really needed?)
>> Yes, there are.  Percentage of FPWs can cause serious problems, as do
>> repeated writouts by the checkpointer.
> 
> One more consideration: data is written to the disk as blocks in any 
> case. If you updated just few bytes on a page, then still the whole page 
> has to be written in database file.
> So avoiding full page writes allows to reduce WAL size and amount of 
> data written to the WAL, but not amount of data written to the database 
> itself.
> It means that if we completely eliminate FPW and transactions are 
> updating random pages, then disk traffic is reduced less than two times...
> 

I don't follow. What do you mean by "less than two times"? Surely the 
difference can be anything between 0 and infinity, depending on how 
often you write a single page.

The other problem with just doing FPI all the time is backups. To do 
physical backups / WAL archival, you need to store all the WAL segments. 
If the amount of WAL increases 10x you're going to be unhappy.

>>
>>
>>> then quite small patch (as it seems to me now) forcing full page write
>>> when distance between page LSN and current WAL insertion point exceeds
>>> some threshold should eliminate random reads also in this case.
>> I'm pretty sure that that'll hurt a significant number of installations,
>> that set the timeout high, just so they can avoid FPWs.
> May be, but I am not so sure. This is why I will try to investigate it 
> more.
> 

I'd say checkpoints already do act as such timeout (not only, but people 
are setting it high to get rid of FPIs).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Jesper Pedersen
Date:
Subject: Re: Index Skip Scan
Next
From: Robert Haas
Date:
Subject: Re: Fast default stuff versus pg_upgrade