adding support for posix_fadvise() - Mailing list pgsql-hackers

From Neil Conway
Subject adding support for posix_fadvise()
Date
Msg-id 1067839664.3089.173.camel@tokyo
Whole thread Raw
Responses Re: adding support for posix_fadvise()
Re: adding support for posix_fadvise()
Re: adding support for posix_fadvise()
List pgsql-hackers
A couple days ago, Manfred Spraul mentioned the posix_fadvise() API on
-hackers:

http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html

I'm working on making use of posix_fadvise() where appropriate. I can
think of the following places where this would be useful:

(1) As Manfred originally noted, when we advance to a new XLOG segment,
we can use POSIX_FADV_DONTNEED to let the kernel know we won't be
accessing the old WAL segment anymore. I've attached a quick kludge of a
patch that implements this. I haven't done any benchmarking of it yet,
though (comments or benchmark results are welcome).

(2) ISTM that we can set POSIX_FADV_RANDOM for *all* indexes, since the
vast majority of the accesses to them shouldn't be sequential. Are there
any situations in which this assumption doesn't hold? (Perhaps B+-tree
bulk loading, or CLUSTER?) Should this be done per-index-AM, or
globally?

(3) When doing VACUUM, ANALYZE, or large sequential scans (for some
reasonable definition of "large"), we can use POSIX_FADV_SEQUENTIAL.

(4) Various other components, such as tuplestore, tuplesort, and any
utility commands that need to scan through an entire user relation for
some reason. Once we've got the APIs for doing this worked out, it
should be relatively easy to add other uses of posix_fadvise().

(5) I'm hesitant to make use of POSIX_FADV_DONTNEED in VACUUM, as has
been suggested elsewhere. The problem is that it's all-or-nothing: if
the VACUUM happens to look at hot pages, these will be flushed from the
page cache, so the net result may be a loss.

So what API is desirable for uses 2-4? I'm thinking of adding a new
function to the smgr API, smgradvise(). Given a Relation and an advice,
this would:

(a) propagate the advice for this relation to all the open FDs for the
relation

(b) store the new advice somewhere so that new FDs for the relation can
have this advice set for them: clients should just be able to call
smgradvise() without needing to worry if someone else has already called
smgropen() for the relation in the past. One problem is how to store
this: I don't think it can be a field of RelationData, since that is
transient. Any suggestions?

Note that I'm assuming that we don't need to set advice on sub-sections
of a relation, although the posix_fadvise() API allows it -- does anyone
think that would be useful?

One potential issue is that when one process calls posix_fadvise() on a
particular FD, I'd expect that other processes accessing the same file
will be affected. For example, enabling FADV_SEQUENTIAL while we're
vacuuming a relation will mean that another client doing a concurrent
SELECT on the relation will see different readahead behavior. That
doesn't seem like a major problem though.

BTW, posix_fadvise() is currently only supported on Linux 2.6 w/ a
recent version of glibc (BSD hackers, if you're listening,
posix_fadvise() would be a very cool thing to have :P). So we'll need to
do the appropriate configure magic to ensure we only use it where its
available. Thankfully, it is a POSIX standard, so I would expect that in
the years to come it will be available on more platforms.

Any comments would be welcome.

-Neil




pgsql-hackers by date:

Previous
From: "Marc G. Fournier"
Date:
Subject: 7.4RC1 tag'd, branched and bundled ...
Next
From: Larry Rosenman
Date:
Subject: Re: 7.4RC1 tag'd, branched and bundled ...