Re: [PATCH] Prefetch index pages for B-Tree index scans - Mailing list pgsql-hackers

From John Lumby
Subject Re: [PATCH] Prefetch index pages for B-Tree index scans
Date
Msg-id COL116-W68157B12672AC6F1A06E4A3600@phx.gbl
Whole thread Raw
In response to Re: [PATCH] Prefetch index pages for B-Tree index scans  (Claudio Freire <klaussfreire@gmail.com>)
Responses Re: [PATCH] Prefetch index pages for B-Tree index scans
List pgsql-hackers
Claudio wrote :
>
> Oops - forgot to effectively attach the patch.
>

I've read through your patch and the earlier posts by you and Cédric.

This is very interesting.      You chose to prefetch index btree (key-ptr) pages
whereas I chose to prefetch the data pages pointed to by the key-ptr pages.
Never mind why  --  I think they should work very well together  -  as both have
been demonstrated to produce improvements.   I will see if I can combine them,
git permitting  (as of course their changed file lists overlap).

I was surprised by this design decision :
    /* start prefetch on next page, but not if we're reading sequentially already, as it's counterproductive in those
cases*/ 
Is it really?    Are you assuming the it's redundant with posix_fadvise for this case?
I think possibly when async_io is also in use by the postgresql prefetcher,
this decision could change.

Cédric wrote:
> If the gain is visible mostly for the backward and not for other access
>
> building the latest kernel with that patch included, replicating the
>

I  found improvement from forward scans.
Actually I did not even try backward but only because I did not have time.
It should help both.

>> I don't even think windows supports posix_fadvise, but if async_io is
>> used (as hinted by the link Lumby posted), it would probably also work
>> in windows.

windows has async io and I think it would not be hard to extend my implementation
to windows  (although I don't plan it myself).     Actually about 95% of the code I wrote
to implement async-io in postgresql concerns not the async io,  which is trivial,
but the buffer management.   With async io,   PrefetchBuffer must allocate and pin a
buffer,  (not too hard),   but now also every other part of buf mgr must know about the
possibility that a buffer may be in async_io_in_progress state and be prepared to
determine the possible completion (quite hard)   -  and also if and when the prefetch requester
comes again with ReadBuffer,  buf mgr has to remember that this buffer was pinned by
this backend during previous prefetch and must not be re-pinned a second time
(very hard without increasing size of the shared descriptor,  which was important
since there could be a very large number of these).
It ended up with a major change to bufmgr.c plus one new file for handling
buffer management aspects of starting, checking and terminating async io.

However I think in some environments the async-io has significant benefits over
posix-fadvise,  especially (of course!)   where access is very non-sequential,
but even also for sequential if there are many concurrent conflicting sets of sequential
command streams from different backends
(always assuming the RAID can manage them concurrently).

I've attached a snapshot patch of just the non-bitmap-index-scan changes I've made.
You can't compile it as is because I had to change the interface to PrefetchBuffer
and add a new DiscardBuffer which I did not include in this snapshot to avoid confusing.

John


Attachment

pgsql-hackers by date:

Previous
From: "Kevin Grittner"
Date:
Subject: Re: Estimation of HashJoin Cost
Next
From: Marti Raudsepp
Date:
Subject: Re: [PATCH] Prefetch index pages for B-Tree index scans