adding support for posix_fadvise() - Mailing list pgsql-hackers
From | Neil Conway |
---|---|
Subject | adding support for posix_fadvise() |
Date | |
Msg-id | 1067839664.3089.173.camel@tokyo Whole thread Raw |
Responses |
Re: adding support for posix_fadvise()
Re: adding support for posix_fadvise() Re: adding support for posix_fadvise() |
List | pgsql-hackers |
A couple days ago, Manfred Spraul mentioned the posix_fadvise() API on -hackers: http://www.opengroup.org/onlinepubs/007904975/functions/posix_fadvise.html I'm working on making use of posix_fadvise() where appropriate. I can think of the following places where this would be useful: (1) As Manfred originally noted, when we advance to a new XLOG segment, we can use POSIX_FADV_DONTNEED to let the kernel know we won't be accessing the old WAL segment anymore. I've attached a quick kludge of a patch that implements this. I haven't done any benchmarking of it yet, though (comments or benchmark results are welcome). (2) ISTM that we can set POSIX_FADV_RANDOM for *all* indexes, since the vast majority of the accesses to them shouldn't be sequential. Are there any situations in which this assumption doesn't hold? (Perhaps B+-tree bulk loading, or CLUSTER?) Should this be done per-index-AM, or globally? (3) When doing VACUUM, ANALYZE, or large sequential scans (for some reasonable definition of "large"), we can use POSIX_FADV_SEQUENTIAL. (4) Various other components, such as tuplestore, tuplesort, and any utility commands that need to scan through an entire user relation for some reason. Once we've got the APIs for doing this worked out, it should be relatively easy to add other uses of posix_fadvise(). (5) I'm hesitant to make use of POSIX_FADV_DONTNEED in VACUUM, as has been suggested elsewhere. The problem is that it's all-or-nothing: if the VACUUM happens to look at hot pages, these will be flushed from the page cache, so the net result may be a loss. So what API is desirable for uses 2-4? I'm thinking of adding a new function to the smgr API, smgradvise(). Given a Relation and an advice, this would: (a) propagate the advice for this relation to all the open FDs for the relation (b) store the new advice somewhere so that new FDs for the relation can have this advice set for them: clients should just be able to call smgradvise() without needing to worry if someone else has already called smgropen() for the relation in the past. One problem is how to store this: I don't think it can be a field of RelationData, since that is transient. Any suggestions? Note that I'm assuming that we don't need to set advice on sub-sections of a relation, although the posix_fadvise() API allows it -- does anyone think that would be useful? One potential issue is that when one process calls posix_fadvise() on a particular FD, I'd expect that other processes accessing the same file will be affected. For example, enabling FADV_SEQUENTIAL while we're vacuuming a relation will mean that another client doing a concurrent SELECT on the relation will see different readahead behavior. That doesn't seem like a major problem though. BTW, posix_fadvise() is currently only supported on Linux 2.6 w/ a recent version of glibc (BSD hackers, if you're listening, posix_fadvise() would be a very cool thing to have :P). So we'll need to do the appropriate configure magic to ensure we only use it where its available. Thankfully, it is a POSIX standard, so I would expect that in the years to come it will be available on more platforms. Any comments would be welcome. -Neil
pgsql-hackers by date: