Re: Proposal of PITR performance improvement for 8.4. - Mailing list pgsql-hackers
From | Koichi Suzuki |
---|---|
Subject | Re: Proposal of PITR performance improvement for 8.4. |
Date | |
Msg-id | a778a7260810291758v76a048c6g9c83d0676de2d040@mail.gmail.com Whole thread Raw |
In response to | Re: Proposal of PITR performance improvement for 8.4. (Simon Riggs <simon@2ndQuadrant.com>) |
List | pgsql-hackers |
Hi, 2008/10/29 Simon Riggs <simon@2ndquadrant.com>: > > On Tue, 2008-10-28 at 14:21 +0200, Heikki Linnakangas wrote: > >> 1. You should avoid useless posix_fadvise() calls. In the naive >> implementation, where you simply call posix_fadvise() for every page >> referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls >> per WAL record, and that's a lot of overhead. We face the same design >> question as with Greg's patch to use posix_fadvise() to prefetch index >> and bitmap scans: what should the interface to the buffer manager look >> like? The simplest approach would be a new function call like >> AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the >> page if it's not in the buffer cache, but is a no-op otherwise. But that >> means more overhead, since for every page access, we need to find the >> page twice in the buffer cache; once for the AdviseBuffer() call, and >> 2nd time for the actual ReadBuffer(). > > That's a much smaller overhead than waiting for an I/O. The CPU overhead > isn't really a problem if we're I/O bound. As disccused last year about parallel recovery and random read problem, recovery is really I/O bound, especially when FPW is not available. And it is not practical to ask all the archive logs to include huge FPWs. > >> It would be more efficient to pin >> the buffer in the AdviseBuffer() call already, but that requires much >> more changes to the callers. > > That would be hard to cleanup safely, plus we'd have difficulty with > timing: is there enough buffer space to allow all the prefetched blocks > live in cache at once? If not, this approach would cause problems. I'm not positive to AdviseBuffer() adea. If we do this, we need all the pages reffered from a WAL segment in the shared buffer. This may be several GB and will compete with kernel cache. Current PostgreSQL highly relies on kernel cache (and kernel I/O schedule) and it is not a good idea to have much shared buffer. The worst case is to spare half of the physical memory to the shared buffer. The performance will be very bad. Rather, I prefer to ask kernel to prefetch. > >> 2. The format of each WAL record is different, so you need a "readahead >> handler" for every resource manager, for every record type. It would be >> a lot simpler if there was a standardized way to store that information >> in the WAL records. > > I would prefer a new rmgr API call that returns a list of blocks. That's > better than trying to make everything fit one pattern. If the call > doesn't exist then that rmgr won't get prefetch. Yes, I'd like this idea. Could you let me try this API through prefetch implementation in the core (if it is agreed)? > >> 3. IIRC I tried to handle just a few most important WAL records at >> first, but it turned out that you really need to handle all WAL records >> (that are used at all) before you see any benefit. Otherwise, every time >> you hit a WAL record that you haven't done posix_fadvise() on, the >> recovery "stalls", and you don't need much of those to diminish the gains. >> >> Not sure how these apply to your approach, it's very different. You seem >> to handle 1. by collecting all the page references for the WAL file, and >> sorting and removing the duplicates. I wonder how much CPU time is spent >> on that? > > Removing duplicates seems like it will save CPU. If we invoke posix_fadvise() to the blocks already in the kernel cache, this call will just do nothing but consume some overhead in the kernel. I think duplicate removal saves more. > > -- > Simon Riggs www.2ndQuadrant.com > PostgreSQL Training, Services and Support > > -- ------ Koichi Suzuki
pgsql-hackers by date: