Thread: O_DIRECT for relations and SLRUs (Prototype)

O_DIRECT for relations and SLRUs (Prototype)

From

Michael Paquier

Date:

12 January 2019, 04:46:32

Hi all,
(Added Kevin in CC)

There have been over the ages discussions about getting better
O_DIRECT support to close the gap with other players in the database
market, but I have not actually seen on those lists a patch which
makes use of O_DIRECT for relations and SLRUs (perhaps I missed it,
anyway that would most likely conflict now).

Attached is a toy patch that I have begun using for tests in this
area.  That's nothing really serious at this stage, but you can use
that if you would like to see the impact of O_DIRECT.  Of course,
things get significantly slower.  The patch is able to compile, pass
regression tests, and looks stable.  So that's usable for experiments.
The patch uses a GUC called direct_io, enabled to true to ease
regression testing when applying it.

Note that pg_attribute_aligned() cannot be used as that's not an
option with clang and a couple of other comilers as far as I know, so
the patch uses a simple set of placeholder buffers large enough to be
aligned with the OS pages, which should be 4k for Linux by the way,
and not set to BLCKSZ, but for WAL's O_DIRECT we don't really care
much with such details.

If there is interest for such things, perhaps we could get a patch
sorted out, with some angles of attack like:
- Move to use of page-aligned buffers for relations and SLRUs.
- Split use of O_DIRECT for SLRU and relations into separate GUCs.
- Perhaps other things.
However this is a large and very controversial topic, and of course
more complex than the experiment attached, still this prototype is fun
to play with.

Thanks for reading!
--
Michael

Attachment

Re: O_DIRECT for relations and SLRUs (Prototype)

From

Andrey Borodin

Date:

12 January 2019, 16:13:20

Hi!

> 12 янв. 2019 г., в 9:46, Michael Paquier <michael@paquier.xyz> написал(а):
>
> Attached is a toy patch that I have begun using for tests in this
> area.  That's nothing really serious at this stage, but you can use
> that if you would like to see the impact of O_DIRECT.  Of course,
> things get significantly slower.

Cool!
I've just gathered a group of students to task them with experimenting with shared buffer eviction algorithms during
theirFebruary internship at Yandex-Sirius edu project. Your patch seems very handy for benchmarks in this area. 

Thanks!

Best regards, Andrey Borodin.

Re: O_DIRECT for relations and SLRUs (Prototype)

From

Thomas Munro

Date:

12 January 2019, 21:35:55

On Sun, Jan 13, 2019 at 5:13 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
> Hi!
>
> > 12 янв. 2019 г., в 9:46, Michael Paquier <michael@paquier.xyz> написал(а):
> >
> > Attached is a toy patch that I have begun using for tests in this
> > area.  That's nothing really serious at this stage, but you can use
> > that if you would like to see the impact of O_DIRECT.  Of course,
> > things get significantly slower.
>
> Cool!
> I've just gathered a group of students to task them with experimenting with shared buffer eviction algorithms during
theirFebruary internship at Yandex-Sirius edu project. Your patch seems very handy for benchmarks in this area. 

+1, thanks for sharing the patch.  Even though just turning on
O_DIRECT is the trivial part of this project, it's good to encourage
discussion.  We may indeed become more sensitive to the quality of
buffer eviction algorithms, but it seems like the main work to regain
lost performance will be the background IO scheduling piece:

1.  We need a new "bgreader" process to do read-ahead.  I think you'd
want a way to tell it with explicit hints (for example, perhaps
sequential scans would advertise that they're reading sequentially so
that it starts to slurp future blocks into the buffer pool, and
streaming replicas might look ahead in the WAL and tell it what's
coming).  In theory this might be better than the heuristics OSes use
to guess our access pattern and pre-fetch into the page cache, since
we have better information (and of course we're skipping a buffer
layer).

2.  We need a new kind of bgwriter/syncer that aggressively creates
clean pages so that foreground processes rarely have to evict (since
that is now super slow), but also efficiently finds ranges of dirty
blocks that it can write in big sequential chunks.

3.  We probably want SLRUs to use the main buffer pool, instead of
their own mini-pools, so they can benefit from the above.

Whether we need multiple bgreader and bgwriter processes or perhaps a
general IO scheduler process may depend on whether we also want to
switch to async (multiplexing from a single process).  Starting simple
with a traditional sync IO and N processes seems OK to me.

--
Thomas Munro
http://www.enterprisedb.com

Re: O_DIRECT for relations and SLRUs (Prototype)

From

Michael Paquier

Date:

13 January 2019, 09:02:16

On Sun, Jan 13, 2019 at 10:35:55AM +1300, Thomas Munro wrote:
> 1.  We need a new "bgreader" process to do read-ahead.  I think you'd
> want a way to tell it with explicit hints (for example, perhaps
> sequential scans would advertise that they're reading sequentially so
> that it starts to slurp future blocks into the buffer pool, and
> streaming replicas might look ahead in the WAL and tell it what's
> coming).  In theory this might be better than the heuristics OSes use
> to guess our access pattern and pre-fetch into the page cache, since
> we have better information (and of course we're skipping a buffer
> layer).

Yes, that could be interesting mainly for analytics by being able to
snipe better than the OS readahead.

> 2.  We need a new kind of bgwriter/syncer that aggressively creates
> clean pages so that foreground processes rarely have to evict (since
> that is now super slow), but also efficiently finds ranges of dirty
> blocks that it can write in big sequential chunks.

Okay, that's a new idea.  A bgwriter able to do syncs in chunks would
be also interesting with O_DIRECT, no?

> 3.  We probably want SLRUs to use the main buffer pool, instead of
> their own mini-pools, so they can benefit from the above.

Wasn't there a thread about that on -hackers actually?  I cannot see
any reference to it.

> Whether we need multiple bgreader and bgwriter processes or perhaps a
> general IO scheduler process may depend on whether we also want to
> switch to async (multiplexing from a single process).  Starting simple
> with a traditional sync IO and N processes seems OK to me.

So you mean that we could just have a simple switch as a first step?
Or I misunderstood you :)

One of the reasons why I have begun this thread is that since we have
heard about the fsync issues on Linux, I think that there is room
for giving our user base more control of their fate without relying on
the Linux community decisions to potentially eat data and corrupt a
cluster with a page dirty bit cleared without its data actually
flushed.  Even the latest kernels are not fixing all the patterns with
open fds across processes, switching the problem from one corner of
the table to another, and there are folks patching the Linux kernel to
make Postgres more reliable from this perspective, and living happily
with this option.  As long as the option can be controlled and
defaults to false, it seems to be that we could do something.  Even if
the performance is bad, this gives the user control of how he/she
wants things to be done.
--
Michael

Attachment

signature.asc

Re: O_DIRECT for relations and SLRUs (Prototype)

From

Andrey Borodin

Date:

13 January 2019, 11:39:16


> 13 янв. 2019 г., в 14:02, Michael Paquier <michael@paquier.xyz> написал(а):
>
>> 3.  We probably want SLRUs to use the main buffer pool, instead of
>> their own mini-pools, so they can benefit from the above.
>
> Wasn't there a thread about that on -hackers actually?  I cannot see
> any reference to it.
I think it's here
https://www.postgresql.org/message-id/flat/CAEepm%3D0o-%3Dd8QPO%3DYGFiBSqq2p6KOvPVKG3bggZi5Pv4nQw8nw%40mail.gmail.com#bacee3e6612c53c31658b18650e7ffd9

> As long as the option can be controlled and
> defaults to false, it seems to be that we could do something.  Even if
> the performance is bad, this gives the user control of how he/she
> wants things to be done.

I like the idea of having this switch, I believe it will make development in this direction easier.
But I think there will be complain from users like "this feature is done wrong" due to really bad performance.

Best regards, Andrey Borodin.

Re: O_DIRECT for relations and SLRUs (Prototype)

From

Thomas Munro

Date:

13 January 2019, 11:53:15

On Sun, Jan 13, 2019 at 10:02 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Sun, Jan 13, 2019 at 10:35:55AM +1300, Thomas Munro wrote:
> > 1.  We need a new "bgreader" process to do read-ahead.  I think you'd
> > want a way to tell it with explicit hints (for example, perhaps
> > sequential scans would advertise that they're reading sequentially so
> > that it starts to slurp future blocks into the buffer pool, and
> > streaming replicas might look ahead in the WAL and tell it what's
> > coming).  In theory this might be better than the heuristics OSes use
> > to guess our access pattern and pre-fetch into the page cache, since
> > we have better information (and of course we're skipping a buffer
> > layer).
>
> Yes, that could be interesting mainly for analytics by being able to
> snipe better than the OS readahead.
>
> > 2.  We need a new kind of bgwriter/syncer that aggressively creates
> > clean pages so that foreground processes rarely have to evict (since
> > that is now super slow), but also efficiently finds ranges of dirty
> > blocks that it can write in big sequential chunks.
>
> Okay, that's a new idea.  A bgwriter able to do syncs in chunks would
> be also interesting with O_DIRECT, no?

Well I'm just describing the stuff that the OS is doing for us in
another layer.  Evicting dirty buffers currently consists of a
buffered pwrite(), which we can do a huge number of per second (given
enough spare RAM), but with O_DIRECT | O_SYNC we'll be limited by
storage device random IOPS, so workloads that evict dirty buffers in
foreground processes regularly will suffer.  bgwriter should make sure
we always find clean buffers without waiting when we need them.

Yeah, I think pwrite() larger than 8KB at a time would be a goal, to
get large IO request sizes all the way down to the storage.

> > 3.  We probably want SLRUs to use the main buffer pool, instead of
> > their own mini-pools, so they can benefit from the above.
>
> Wasn't there a thread about that on -hackers actually?  I cannot see
> any reference to it.

https://www.postgresql.org/message-id/flat/20180814213500.GA74618%4060f81dc409fc.ant.amazon.com

> > Whether we need multiple bgreader and bgwriter processes or perhaps a
> > general IO scheduler process may depend on whether we also want to
> > switch to async (multiplexing from a single process).  Starting simple
> > with a traditional sync IO and N processes seems OK to me.
>
> So you mean that we could just have a simple switch as a first step?
> Or I misunderstood you :)

I just meant that if we take over all the read-ahead and write-behind
work and use classic synchronous IO syscalls like pread()/pwrite(),
we'll probably need multiple processes to do it, depending on how much
IO concurrency the storage layer can take.

-- 
Thomas Munro
http://www.enterprisedb.com

RE: O_DIRECT for relations and SLRUs (Prototype)

From

"Tsunakawa, Takayuki"

Date:

15 January 2019, 00:50:23

From: Michael Paquier [mailto:michael@paquier.xyz]
> One of the reasons why I have begun this thread is that since we have heard
> about the fsync issues on Linux, I think that there is room for giving our
> user base more control of their fate without relying on the Linux community
> decisions to potentially eat data and corrupt a cluster with a page dirty
> bit cleared without its data actually flushed.  Even the latest kernels
> are not fixing all the patterns with open fds across processes, switching
> the problem from one corner of the table to another, and there are folks
> patching the Linux kernel to make Postgres more reliable from this
> perspective, and living happily with this option.  As long as the option
> can be controlled and defaults to false, it seems to be that we could do
> something.  Even if the performance is bad, this gives the user control
> of how he/she wants things to be done.

Thank you for starting an interesting topic.  We probably want the direct I/O.  On a INSERT and UPDATE heavy system
withPostgreSQL 9.2, we suffered from occasional high response times due to the Linux page cache activity.  Postgres
processescompeted for the page cache to read/write the data files, write online and archive WAL files, and write the
serverlog files (auto_explain and autovacuum workers emitted a lot of logs.)  The user with Oracle experience asked why
PostgreSQLdoesn't handle database I/O by itself...
 

And I wonder how useful the direct I/O for low latency devices like the persistent memory.  The overhead of the page
cachemay become relatively higher.
 


Regards
Takayuki Tsunakawa

Re: O_DIRECT for relations and SLRUs (Prototype)

From

Andrey Borodin

Date:

15 January 2019, 06:19:48

> 12 янв. 2019 г., в 9:46, Michael Paquier <michael@paquier.xyz> написал(а):
>
> Note that pg_attribute_aligned() cannot be used as that's not an
> option with clang and a couple of other comilers as far as I know, so
> the patch uses a simple set of placeholder buffers large enough to be
> aligned with the OS pages, which should be 4k for Linux by the way,
> and not set to BLCKSZ, but for WAL's O_DIRECT we don't really care
> much with such details.

Is it possible to avoid those memcopy's by aligning available buffers instead?
I couldn't understand this from the patch and this thread.

Best regards, Andrey Borodin.

Re: O_DIRECT for relations and SLRUs (Prototype)

From

Michael Paquier

Date:

15 January 2019, 08:28:01

On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote:
> Is it possible to avoid those memcpy's by aligning available buffers
> instead?  I couldn't understand this from the patch and this thread.

Sure, it had better do that.  That's just a lazy implementation.
--
Michael

Attachment

signature.asc

Re: O_DIRECT for relations and SLRUs (Prototype)

From

Maksim Milyutin

Date:

15 January 2019, 16:40:12

On 1/15/19 11:28 AM, Michael Paquier wrote:

> On Tue, Jan 15, 2019 at 11:19:48AM +0500, Andrey Borodin wrote:
>> Is it possible to avoid those memcpy's by aligning available buffers
>> instead?  I couldn't understand this from the patch and this thread.
> Sure, it had better do that.  That's just a lazy implementation.

Hi!

Could you specify all cases when buffers will not be aligned with BLCKSZ?

AFAIC shared and temp buffers are aligned. And what ones are not?

-- 
Regards, Maksim Milyutin

Re: O_DIRECT for relations and SLRUs (Prototype)

From

Michael Paquier

Date:

16 January 2019, 01:54:28

On Tue, Jan 15, 2019 at 07:40:12PM +0300, Maksim Milyutin wrote:
> Could you specify all cases when buffers will not be aligned with BLCKSZ?
>
> AFAIC shared and temp buffers are aligned. And what ones are not?

SLRU buffers are not aligned with the OS pages (aka alignment with
4096 at least).  There are also a bunch of code paths where the callers
of mdread() or mdwrite() don't do that, which makes a correct patch
more invasive.
--
Michael

Attachment

signature.asc

Re: O_DIRECT for relations and SLRUs (Prototype)

From

Robert Haas

Date:

16 January 2019, 16:16:51

On Sat, Jan 12, 2019 at 4:36 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> 1.  We need a new "bgreader" process to do read-ahead.  I think you'd
> want a way to tell it with explicit hints (for example, perhaps
> sequential scans would advertise that they're reading sequentially so
> that it starts to slurp future blocks into the buffer pool, and
> streaming replicas might look ahead in the WAL and tell it what's
> coming).  In theory this might be better than the heuristics OSes use
> to guess our access pattern and pre-fetch into the page cache, since
> we have better information (and of course we're skipping a buffer
> layer).

Right, like if we're reading the end of relation file 16384, we can
prefetch the beginning of 16384.1, but the OS won't know to do that.

> 2.  We need a new kind of bgwriter/syncer that aggressively creates
> clean pages so that foreground processes rarely have to evict (since
> that is now super slow), but also efficiently finds ranges of dirty
> blocks that it can write in big sequential chunks.

Yeah.

> 3.  We probably want SLRUs to use the main buffer pool, instead of
> their own mini-pools, so they can benefit from the above.

Right.  I think this is important, and it makes me think that maybe
Michael's patch won't help us much in the end.  I believe that the
number of pages that are needed for clog data, at least, can very
significantly depending on workload and machine size, so there's not
one number there that is going to work for everybody, and the
algorithms the SLRU code uses for page management have O(n) stuff in
them, so they don't scale well to large numbers of SLRU buffers
anyway.  I think we should try to unify the SLRU stuff with
shared_buffers, and then have a test patch like Michael's (not for
commit) which we can use to see the impact of that, and then try to
reduce that impact with the stuff you mention under #1 and #2.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company