Re: [PATCH] Prefetch index pages for B-Tree index scans - Mailing list pgsql-hackers
From | John Lumby |
---|---|
Subject | Re: [PATCH] Prefetch index pages for B-Tree index scans |
Date | |
Msg-id | COL116-W162AEAA64173E77D4597EEA3670@phx.gbl Whole thread Raw |
In response to | Re: [PATCH] Prefetch index pages for B-Tree index scans (Claudio Freire <klaussfreire@gmail.com>) |
Responses |
Re: [PATCH] Prefetch index pages for B-Tree index scans
|
List | pgsql-hackers |
Claudio Freire wrote: > > On Thu, Nov 1, 2012 at 10:59 PM, Greg Smith <greg@2ndquadrant.com> wrote: > > On 11/1/12 6:13 PM, Claudio Freire wrote: > > > >> posix_fadvise what's the trouble there, but the fact that the kernel > >> stops doing read-ahead when a call to posix_fadvise comes. I noticed > >> the performance hit, and checked the kernel's code. It effectively > >> changes the prediction mode from sequential to fadvise, negating the > >> (assumed) kernel's prefetch logic. > > > > The Linux posix_fadvise implementation never seemed like it was well liked > > by the kernel developers. Quirky stuff like this popped up all the time > > during that period, when effective_io_concurrency was being added. I wonder > > how far back the fadvise/read-ahead conflict goes back. > > Well, to be precise it's not so much as a problem in posix_fadvise > itself, it's a problem in how it interacts with readahead. Since > readahead works at the memory mapper level, and only when actually > performing I/O (which would seem at first glance quite sensible), it > doesn't get to see fadvise activity. > > FADV_WILLNEED is implemented as a forced readahead, which doesn't > update any of the readahead context structures. Again, at first > glance, this would seem sensible (explicit hints shouldn't interfere > with pattern detection logic). However, since those pages are (after > the fadvise call) under async I/O, next time the memory mapper needs > that page, instead of requesting I/O through readahead logic, it will > wait for async I/O to complete. > > IOW, what was sequential in fact, became invisible to readahead, > indistinguishable from random I/O. Whatever page fadvise failed to > predict will be treated as random I/O, and here the trouble lies. And this may be one other advantage of async io over posix_fadvise in the linux environment (with the present mmap behaviour) : that async io achives the same objective of improving disk/processing overlap without the mentioned interference with read-ahead. Although to confirm this would ideally require 3-way comparing posix-fadvise + existing readahead behaviour posix-fadvise + modify existing readahead behaviour to not force waiting for current async io (i.e. just check the aio and continue normal readahead if in progress) async io wth no posix-fadvise It seems in general to be preferable to have an implementation that is less dependent on specific behaviour of the OS read-head mechanism. > > >> I've mused about the possibility to batch async_io requests, and use > >> the scatter/gather API instead of sending tons of requests to the > > > >> kernel. I think doing so would enable a zero-copy path that could very > >> possibly imply big speed improvements when memory bandwidth is the > >> bottleneck. > > > > Another possibly useful bit of history here for you. Greg Stark wrote a > > test program that used async I/O effectively on both Linux and Solaris. > > Unfortunately, it was hard to get that to work given how Postgres does its > > buffer I/O, and using processes instead of threads. This looks like the > > place he commented on why: > > > > http://postgresql.1045698.n5.nabble.com/Multi-CPU-Queries-Feedback-and-or-suggestions-wanted-td1993361i20.html > > > > The part I think was relevant there from him: > > > > "In the libaio view of the world you initiate io and either get a > > callback or call another syscall to test if it's complete. Either > > approach has problems for Postgres. If the process that initiated io > > is in the middle of a long query it might take a long time, or not even > > never get back to complete the io. The callbacks use threads... > > > > And polling for completion has the problem that another process could > > be waiting on the io and can't issue a read as long as the first > > process has the buffer locked and io in progress. I think aio makes a > > lot more sense if you're using threads so you can start a thread to > > wait for the io to complete." > > I noticed that. I always envisioned async I/O as managed by some > dedicated process. One that could check for completion or receive > callbacks. Postmaster, for instance. Thanks for the mentioning this posting. Interesting. However, the OP describes an implementation based on libaio. Today what we have (for linux) is librt, which is quite different. It is arguable worse than libaio (well actually I am sure it is worse) since it is essentially just an encapsulation of using threads to do synchronous ios - you can look at it as making it easier to do what the application could do itself if it set up its own pthreads. The linux kernel does not know about it and so the CPU overhead of checking for completion is higher. But if async io is used *only* for prefetching, and not for the actual ReadBuffer itself (which is what I did), then the problem mentioned by the OP "If the process that initiated io is in the middle of a long query it might take a long time, or never get back to complete the io" is not a problem. The model is simple: 1 . backend process P1 requestrs prefetch on a relation/block R/B which results in initating aio_read using a (new) shared control block (call it pg_buf_aiocb) which tracks the request and also contains the librt's aiocb. 2 . any backend P2 (which may or may not be P1) issues ReadBuffer or similar, requesting access for read/pin to buffer containing R/B. this backend discovers that the buffer is aio_in_progress and calls check_completion(pg_buf_aiocb), and waits (effectively on the librt thread) if not complete. 3 . any number of other backends may concurrently do same as 2. I.e. If pg_buf_aiocb is still aio-in-progress, they all also wait on the librt thread. 4. Each waiting backend receives the completion and the last one does the housekeeping and returns the pg_buf_aiocb. What complicates it is managing the associated pinned buffer in such a way that every backend takes the correct action with the correct degree of serialization of the buffer descriptor during critical sections, but yet allowing all backends in 3. above to concurrently wait/check. After quite a lot of testing I think I now this correct. ("I just found the *last* bug!" :-) John
pgsql-hackers by date: