James Bottomley <James.Bottomley@HansenPartnership.com> writes:
> The current mechanism for coherency between a userspace cache and the
> in-kernel page cache is mmap ... that's the only way you get the same
> page in both currently.
Right.
> glibc used to have an implementation of read/write in terms of mmap, so
> it should be possible to insert it into your current implementation
> without a major rewrite. The problem I think this brings you is
> uncontrolled writeback: you don't want dirty pages to go to disk until
> you issue a write()
Exactly.
> I think we could fix this with another madvise():
> something like MADV_WILLUPDATE telling the page cache we expect to alter
> the pages again, so don't be aggressive about cleaning them.
"Don't be aggressive" isn't good enough. The prohibition on early write
has to be absolute, because writing a dirty page before we've done
whatever else we need to do results in a corrupt database. It has to
be treated like a write barrier.
> The problem is we can't give you absolute control of when pages are
> written back because that interface can be used to DoS the system: once
> we get too many dirty uncleanable pages, we'll thrash looking for memory
> and the system will livelock.
Understood, but that makes this direction a dead end. We can't use
it if the kernel might decide to write anyway.
regards, tom lane