On Tue, 2008-10-28 at 14:21 +0200, Heikki Linnakangas wrote:
> 1. You should avoid useless posix_fadvise() calls. In the naive
> implementation, where you simply call posix_fadvise() for every page
> referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls
> per WAL record, and that's a lot of overhead. We face the same design
> question as with Greg's patch to use posix_fadvise() to prefetch index
> and bitmap scans: what should the interface to the buffer manager look
> like? The simplest approach would be a new function call like
> AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the
> page if it's not in the buffer cache, but is a no-op otherwise. But that
> means more overhead, since for every page access, we need to find the
> page twice in the buffer cache; once for the AdviseBuffer() call, and
> 2nd time for the actual ReadBuffer().
That's a much smaller overhead than waiting for an I/O. The CPU overhead
isn't really a problem if we're I/O bound.
> It would be more efficient to pin
> the buffer in the AdviseBuffer() call already, but that requires much
> more changes to the callers.
That would be hard to cleanup safely, plus we'd have difficulty with
timing: is there enough buffer space to allow all the prefetched blocks
live in cache at once? If not, this approach would cause problems.
> 2. The format of each WAL record is different, so you need a "readahead
> handler" for every resource manager, for every record type. It would be
> a lot simpler if there was a standardized way to store that information
> in the WAL records.
I would prefer a new rmgr API call that returns a list of blocks. That's
better than trying to make everything fit one pattern. If the call
doesn't exist then that rmgr won't get prefetch.
> 3. IIRC I tried to handle just a few most important WAL records at
> first, but it turned out that you really need to handle all WAL records
> (that are used at all) before you see any benefit. Otherwise, every time
> you hit a WAL record that you haven't done posix_fadvise() on, the
> recovery "stalls", and you don't need much of those to diminish the gains.
>
> Not sure how these apply to your approach, it's very different. You seem
> to handle 1. by collecting all the page references for the WAL file, and
> sorting and removing the duplicates. I wonder how much CPU time is spent
> on that?
Removing duplicates seems like it will save CPU.
-- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support