Re: O_DIRECT for relations and SLRUs (Prototype) - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: O_DIRECT for relations and SLRUs (Prototype)
Date
Msg-id 20190113090216.GB6220@paquier.xyz
Whole thread Raw
In response to Re: O_DIRECT for relations and SLRUs (Prototype)  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: O_DIRECT for relations and SLRUs (Prototype)
Re: O_DIRECT for relations and SLRUs (Prototype)
RE: O_DIRECT for relations and SLRUs (Prototype)
List pgsql-hackers
On Sun, Jan 13, 2019 at 10:35:55AM +1300, Thomas Munro wrote:
> 1.  We need a new "bgreader" process to do read-ahead.  I think you'd
> want a way to tell it with explicit hints (for example, perhaps
> sequential scans would advertise that they're reading sequentially so
> that it starts to slurp future blocks into the buffer pool, and
> streaming replicas might look ahead in the WAL and tell it what's
> coming).  In theory this might be better than the heuristics OSes use
> to guess our access pattern and pre-fetch into the page cache, since
> we have better information (and of course we're skipping a buffer
> layer).

Yes, that could be interesting mainly for analytics by being able to
snipe better than the OS readahead.

> 2.  We need a new kind of bgwriter/syncer that aggressively creates
> clean pages so that foreground processes rarely have to evict (since
> that is now super slow), but also efficiently finds ranges of dirty
> blocks that it can write in big sequential chunks.

Okay, that's a new idea.  A bgwriter able to do syncs in chunks would
be also interesting with O_DIRECT, no?

> 3.  We probably want SLRUs to use the main buffer pool, instead of
> their own mini-pools, so they can benefit from the above.

Wasn't there a thread about that on -hackers actually?  I cannot see
any reference to it.

> Whether we need multiple bgreader and bgwriter processes or perhaps a
> general IO scheduler process may depend on whether we also want to
> switch to async (multiplexing from a single process).  Starting simple
> with a traditional sync IO and N processes seems OK to me.

So you mean that we could just have a simple switch as a first step?
Or I misunderstood you :)

One of the reasons why I have begun this thread is that since we have
heard about the fsync issues on Linux, I think that there is room
for giving our user base more control of their fate without relying on
the Linux community decisions to potentially eat data and corrupt a
cluster with a page dirty bit cleared without its data actually
flushed.  Even the latest kernels are not fixing all the patterns with
open fds across processes, switching the problem from one corner of
the table to another, and there are folks patching the Linux kernel to
make Postgres more reliable from this perspective, and living happily
with this option.  As long as the option can be controlled and
defaults to false, it seems to be that we could do something.  Even if
the performance is bad, this gives the user control of how he/she
wants things to be done.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: [Sender Address Forgery]Re: error message when subscriptiontarget is a partitioned table
Next
From: Peter Eisentraut
Date:
Subject: Re: could recovery_target_timeline=latest be the default in standbymode?