Re: What is the posix_memalign() equivalent for the PostgreSQL? - Mailing list pgsql-hackers

From Robert Haas
Subject Re: What is the posix_memalign() equivalent for the PostgreSQL?
Date
Msg-id CA+TgmoZEcG3u7DzTpQtzYUpqRnvb3cKdW3G+ZbdM+9Lq=JeQTQ@mail.gmail.com
Whole thread Raw
In response to Re: What is the posix_memalign() equivalent for the PostgreSQL?  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Fri, Sep 2, 2016 at 1:17 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-09-02 13:05:37 -0400, Tom Lane wrote:
>> Anderson Carniel <accarniel@gmail.com> writes:
>> > If not, according to your experience, is there a
>> > significance difference between the performance of the O_DIRECT or not?
>>
>> AFAIK, nobody's really bothered to measure whether that would be useful
>> for Postgres.  The results would probably be quite platform-specific
>> anyway.
>
> I've played with patches to make postgres use O_DIRECT. On linux, it's
> rather beneficial for some workloads (fits into memory), but it also
> works really badly for some others, because our IO code isn't
> intelligent enough.  We pretty much rely on write() being nearly
> instantaneous when done by normal backends (during buffer replacement),
> we rely on readahead, we rely on the kernel to stopgap some bad
> replacement decisions we're making.

So, suppose we changed the world so that backends don't write dirty
buffers, or at least not normally.  If they need to perform a buffer
eviction, they first check the freelist, then run the clock sweep.
The clock sweep puts clean buffers on the freelist and dirty buffers
on a to-be-cleaned list.  A background process writes buffers on the
to-be-cleaned list and then adds them to the freelist afterward if the
usage count hasn't been bumped meanwhile.  As in Amit's bgreclaimer
patch, we have a target size for the freelist, with a low watermark
and a high watermark.  When we drop below the low watermark, the
background processes run the clock sweep and write from the
to-be-cleaned list to try to populate it; when we surge above the high
watermark, they go back to sleep.

Further, suppose we also create a prefetch system, maybe based on the
synchronous scan machinery.  It preemptively pulls data into
shared_buffers if an ongoing scan will need it soon.  Or maybe don't
base it on the synchronous scan machinery, but instead just have a
queue that lets backends throw prefetch requests over the wall; when
the queue wraps, old requests are discarded.  A background process -
or perhaps one per tablespace or something like that - pull the data
in.

Neither of those things seems that hard.  And if we could do those
things and make them work, then maybe we could offer direct I/O as an
option.  We'd still lose heavily in the case where our buffer eviction
decisions are poor, but that'd probably spur some improvement in that
area, which IMHO would be a good thing.

I personally think direct I/O would be a really good thing, not least
because O_ATOMIC is designed to allow MySQL to avoid double buffering,
their alternative to full page writes.  But we can't use it because it
requires O_DIRECT.  The savings are probably massive.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: pageinspect: Hash index support
Next
From: Robert Haas
Date:
Subject: Re: Choosing parallel_degree