Re: Extended Prefetching using Asynchronous IO - proposal and patch - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: Extended Prefetching using Asynchronous IO - proposal and patch |
Date | |
Msg-id | CAM-w4HNr0FOfYXa74tQ8=pdZLt7dP=S4tdGPhXu+=2vWWd23Ng@mail.gmail.com Whole thread Raw |
In response to | Re: Extended Prefetching using Asynchronous IO - proposal and patch (Heikki Linnakangas <hlinnakangas@vmware.com>) |
Responses |
Re: Extended Prefetching using Asynchronous IO - proposal and patch
|
List | pgsql-hackers |
On Wed, May 28, 2014 at 2:19 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > How portable is POSIX aio nowadays? Googling around, it still seems that on > Linux, it's implemented using threads. Does the thread-emulation > implementation cause problems with the rest of the backend, which assumes > that there is only a single thread? In any case, I think we'll want to > encapsulate the AIO implementation behind some kind of an API, to allow > other implementations to co-exist. I think POSIX aio is pretty damn standard and it's a pretty fiddly interface. If we abstract it behind an i/o interface we're going to lose a lot of the power. Abstracting it behind a set of buffer manager operations (initiate i/o on buffer, complete i/o on buffer, abort i/o on buffer) should be fine but that's basically what we have, no? I don't think the threaded implementation on Linux is the one to use though. I find this *super* confusing but the kernel definitely supports aio syscalls, glibc also has a threaded implementation it uses if run on a kernel that doesn't implement the syscalls, and I think there are existing libaio and librt libraries from outside glibc that do one or the other. Which you build against seems to make a big difference. My instinct is that anything but the kernel native implementation will be worthless. The overhead of thread communication will completely outweigh any advantage over posix_fadvise's partial win. The main advantage of posix aio is that we can actually receive the data out of order. With posix_fadvise we can get the i/o and cpu overlap but we will never process the later blocks until the earlier requests are satisfied and processed in order. With aio you could do a sequential scan, initiating i/o on 1,000 blocks and then processing them as they arrive, initiating new requests as those blocks are handled. When I investigated this I found the buffer manager's I/O bits seemed to already be able to represent the state we needed (i/o initiated on this buffer but not completed). The problem was in ensuring that a backend would process the i/o completion promptly when it might be in the midst of handling other tasks and might even get an elog() stack unwinding. The interface that actually fits Postgres best might be the threaded interface (orthogonal to the threaded implementation question) which is you give aio a callback which gets called on a separate thread when the i/o completes. The alternative is you give aio a list of operation control blocks and it tells you the state of all the i/o operations. But it's not clear to me how you arrange to do that regularly, promptly, and reliably. The other gotcha here is that the kernel implementation only does anything useful on DIRECT_IO files. That means you have to do *all* the prefetching and i/o scheduling yourself. You would be doing that anyways for sequential scans and bitmap scans -- and we already do it with things like synchronised scans and posix_fadvise -- but index scans would need to get some intelligence for when it makes sense to read more than one page at a time. It might be possible to do something fairly coarse like having our i/o operators keep track of how often i/o on a relation falls within a certain number of blocks of an earlier i/o and autotune number of blocks to read based on that. It might not be hard to do better than the kernel with even basic info like what level of the index we're reading or what type of pointer we're following. Finally, when I did the posix_fadvise work I wrote a synthetic benchmark for testing the equivalent i/o pattern of a bitmap scan. It let me simulate bitmap scans of varying densities with varying parameters, notably how many i/o to keep in flight at once. It supported posix_fadvise or aio. You should look it up in the archives, it made for some nice looking graphs. IIRC I could not find any build environment where aio offered any performance boost at all. I think this means I just didn't know how to build it against the right libraries or wasn't using the right kernel or there was some skew between them at the time. -- greg
pgsql-hackers by date: