Re: Proposal of PITR performance improvement for 8.4. - Mailing list pgsql-hackers

From Koichi Suzuki
Subject Re: Proposal of PITR performance improvement for 8.4.
Date
Msg-id a778a7260810291758v76a048c6g9c83d0676de2d040@mail.gmail.com
Whole thread Raw
In response to Re: Proposal of PITR performance improvement for 8.4.  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
Hi,


2008/10/29 Simon Riggs <simon@2ndquadrant.com>:
>
> On Tue, 2008-10-28 at 14:21 +0200, Heikki Linnakangas wrote:
>
>> 1. You should avoid useless posix_fadvise() calls. In the naive
>> implementation, where you simply call posix_fadvise() for every page
>> referenced in every WAL record, you'll do 1-2 posix_fadvise() syscalls
>> per WAL record, and that's a lot of overhead. We face the same design
>> question as with Greg's patch to use posix_fadvise() to prefetch index
>> and bitmap scans: what should the interface to the buffer manager look
>> like? The simplest approach would be a new function call like
>> AdviseBuffer(Relation, BlockNumber), that calls posix_fadvise() for the
>> page if it's not in the buffer cache, but is a no-op otherwise. But that
>> means more overhead, since for every page access, we need to find the
>> page twice in the buffer cache; once for the AdviseBuffer() call, and
>> 2nd time for the actual ReadBuffer().
>
> That's a much smaller overhead than waiting for an I/O. The CPU overhead
> isn't really a problem if we're I/O bound.

As disccused last year about parallel recovery and random read
problem,  recovery is really I/O bound, especially when FPW is not
available.   And it is not practical to ask all the archive logs to
include huge FPWs.

>
>> It would be more efficient to pin
>> the buffer in the AdviseBuffer() call already, but that requires much
>> more changes to the callers.
>
> That would be hard to cleanup safely, plus we'd have difficulty with
> timing: is there enough buffer space to allow all the prefetched blocks
> live in cache at once? If not, this approach would cause problems.

I'm not positive to AdviseBuffer() adea.   If we do this, we need all
the pages reffered from a WAL segment in the shared buffer.   This may
be several GB and will compete with kernel cache.   Current
PostgreSQL highly relies on kernel cache (and kernel I/O schedule) and
it is not a good idea to have much shared buffer.   The worst case is
to spare half of the physical memory to the shared buffer.   The
performance will be very bad.     Rather, I prefer to ask kernel to
prefetch.

>
>> 2. The format of each WAL record is different, so you need a "readahead
>> handler" for every resource manager, for every record type. It would be
>> a lot simpler if there was a standardized way to store that information
>> in the WAL records.
>
> I would prefer a new rmgr API call that returns a list of blocks. That's
> better than trying to make everything fit one pattern. If the call
> doesn't exist then that rmgr won't get prefetch.

Yes, I'd like this idea.   Could you let me try this API through
prefetch implementation in the core (if it is agreed)?

>
>> 3. IIRC I tried to handle just a few most important WAL records at
>> first, but it turned out that you really need to handle all WAL records
>> (that are used at all) before you see any benefit. Otherwise, every time
>> you hit a WAL record that you haven't done posix_fadvise() on, the
>> recovery "stalls", and you don't need much of those to diminish the gains.
>>
>> Not sure how these apply to your approach, it's very different. You seem
>> to handle 1. by collecting all the page references for the WAL file, and
>> sorting and removing the duplicates. I wonder how much CPU time is spent
>> on that?
>
> Removing duplicates seems like it will save CPU.

If we invoke posix_fadvise() to the blocks already in the kernel
cache, this call will just do nothing but consume some overhead in the
kernel.   I think duplicate removal saves more.

>
> --
>  Simon Riggs           www.2ndQuadrant.com
>  PostgreSQL Training, Services and Support
>
>



-- 
------
Koichi Suzuki


pgsql-hackers by date:

Previous
From: "Koichi Suzuki"
Date:
Subject: Re: Proposal of PITR performance improvement for 8.4.
Next
From: Simon Riggs
Date:
Subject: Re: Proposal of PITR performance improvement for 8.4.