Re: O_DIRECT for relations and SLRUs (Prototype) - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: O_DIRECT for relations and SLRUs (Prototype)
Date
Msg-id CAEepm=09HOQ7cDpcOqWfuYV-rL54=V9s2jkVrD2FG+ELLE9zQw@mail.gmail.com
Whole thread Raw
In response to Re: O_DIRECT for relations and SLRUs (Prototype)  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On Sun, Jan 13, 2019 at 10:02 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Sun, Jan 13, 2019 at 10:35:55AM +1300, Thomas Munro wrote:
> > 1.  We need a new "bgreader" process to do read-ahead.  I think you'd
> > want a way to tell it with explicit hints (for example, perhaps
> > sequential scans would advertise that they're reading sequentially so
> > that it starts to slurp future blocks into the buffer pool, and
> > streaming replicas might look ahead in the WAL and tell it what's
> > coming).  In theory this might be better than the heuristics OSes use
> > to guess our access pattern and pre-fetch into the page cache, since
> > we have better information (and of course we're skipping a buffer
> > layer).
>
> Yes, that could be interesting mainly for analytics by being able to
> snipe better than the OS readahead.
>
> > 2.  We need a new kind of bgwriter/syncer that aggressively creates
> > clean pages so that foreground processes rarely have to evict (since
> > that is now super slow), but also efficiently finds ranges of dirty
> > blocks that it can write in big sequential chunks.
>
> Okay, that's a new idea.  A bgwriter able to do syncs in chunks would
> be also interesting with O_DIRECT, no?

Well I'm just describing the stuff that the OS is doing for us in
another layer.  Evicting dirty buffers currently consists of a
buffered pwrite(), which we can do a huge number of per second (given
enough spare RAM), but with O_DIRECT | O_SYNC we'll be limited by
storage device random IOPS, so workloads that evict dirty buffers in
foreground processes regularly will suffer.  bgwriter should make sure
we always find clean buffers without waiting when we need them.

Yeah, I think pwrite() larger than 8KB at a time would be a goal, to
get large IO request sizes all the way down to the storage.

> > 3.  We probably want SLRUs to use the main buffer pool, instead of
> > their own mini-pools, so they can benefit from the above.
>
> Wasn't there a thread about that on -hackers actually?  I cannot see
> any reference to it.

https://www.postgresql.org/message-id/flat/20180814213500.GA74618%4060f81dc409fc.ant.amazon.com

> > Whether we need multiple bgreader and bgwriter processes or perhaps a
> > general IO scheduler process may depend on whether we also want to
> > switch to async (multiplexing from a single process).  Starting simple
> > with a traditional sync IO and N processes seems OK to me.
>
> So you mean that we could just have a simple switch as a first step?
> Or I misunderstood you :)

I just meant that if we take over all the read-ahead and write-behind
work and use classic synchronous IO syscalls like pread()/pwrite(),
we'll probably need multiple processes to do it, depending on how much
IO concurrency the storage layer can take.

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: O_DIRECT for relations and SLRUs (Prototype)
Next
From: Andrew Dunstan
Date:
Subject: Re: Three animals fail test-decoding-check on REL_10_STABLE