Matt Clark wrote:
> I'm thinking along the lines of an FS that's aware of PG's strategies and
> requirements and therefore optimised to make those activities as efiicient
> as possible - possibly even being aware of PG's disk layout and treating
> files differently on that basis.
As someone else noted, this doesn't belong in the filesystem (rather the
kernel's block I/O layer/buffer cache). But I agree, an API by which we
can tell the kernel what kind of I/O behavior to expect would be good.
The kernel needs to provide good behavior for a wide range of
applications, but the DBMS can take advantage of a lot of
domain-specific information. In theory, being able to pass that
domain-specific information on to the kernel would mean we could get
better performance without needing to reimplement large chunks of
functionality that really ought to be done by the kernel anyway (as
implementing raw I/O would require, for example). On the other hand, it
would probably mean adding a fair bit of OS-specific hackery, which
we've largely managed to avoid in the past.
The closest API to what you're describing that I'm aware of is
posix_fadvise(). While that is technically-speaking a POSIX standard, it
is not widely implemented (I know Linux 2.6 implements it; based on some
quick googling, it looks like AIX does too). Using posix_fadvise() has
been discussed in the past, so you might want to search the archives. We
could use FADV_SEQUENTIAL to request more aggressive readahead on a file
that we know we're about to sequentially scan. We might be able to use
FADV_NOREUSE on the WAL. We might be able to get away with specifying
FADV_RANDOM for indexes all of the time, or at least most of the time.
One question is how this would interact with concurrent access (AFAICS
there is no way to fetch the "current advice" on an fd...)
Also, I would imagine Win32 provides some means to inform the kernel
about your expected I/O pattern, but I haven't checked. Does anyone know
of any other relevant APIs?
-Neil