Re: Extended Prefetching using Asynchronous IO - proposal and patch - Mailing list pgsql-hackers
From | Claudio Freire |
---|---|
Subject | Re: Extended Prefetching using Asynchronous IO - proposal and patch |
Date | |
Msg-id | CAGTBQpbw56ygQbuZ0tWGbECWwxsmsEq-kZxNvfmvYsC_6u3wPg@mail.gmail.com Whole thread Raw |
In response to | Re: Extended Prefetching using Asynchronous IO - proposal and patch (Greg Stark <stark@mit.edu>) |
Responses |
Re: Extended Prefetching using Asynchronous IO - proposal and patch
|
List | pgsql-hackers |
On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark <stark@mit.edu> wrote: > I don't think the threaded implementation on Linux is the one to use > though. I find this *super* confusing but the kernel definitely > supports aio syscalls, glibc also has a threaded implementation it > uses if run on a kernel that doesn't implement the syscalls, and I > think there are existing libaio and librt libraries from outside glibc > that do one or the other. Which you build against seems to make a big > difference. My instinct is that anything but the kernel native > implementation will be worthless. The overhead of thread communication > will completely outweigh any advantage over posix_fadvise's partial > win. What overhead? The only communication is through a "done" bit and associated synchronization structure when *and only when* you want to wait on it. Furthermore, posix_fadvise is braindead on this use case, been there, done that. What you win with threads is a better postgres-kernel interaction, even if you loose some CPU performance it's gonna beat posix_fadvise by a large margin. > The main advantage of posix aio is that we can actually receive the > data out of order. With posix_fadvise we can get the i/o and cpu > overlap but we will never process the later blocks until the earlier > requests are satisfied and processed in order. With aio you could do a > sequential scan, initiating i/o on 1,000 blocks and then processing > them as they arrive, initiating new requests as those blocks are > handled. And each and every I/O you issue with it counts as such on the kernel side. It's not the case with posix_fadvise, mind you, and that's terribly damaging for performance. > When I investigated this I found the buffer manager's I/O bits seemed > to already be able to represent the state we needed (i/o initiated on > this buffer but not completed). The problem was in ensuring that a > backend would process the i/o completion promptly when it might be in > the midst of handling other tasks and might even get an elog() stack > unwinding. The interface that actually fits Postgres best might be the > threaded interface (orthogonal to the threaded implementation > question) which is you give aio a callback which gets called on a > separate thread when the i/o completes. The alternative is you give > aio a list of operation control blocks and it tells you the state of > all the i/o operations. But it's not clear to me how you arrange to do > that regularly, promptly, and reliably. Indeed we've been musing about using librt's support of completion callbacks for that. > The other gotcha here is that the kernel implementation only does > anything useful on DIRECT_IO files. That means you have to do *all* > the prefetching and i/o scheduling yourself. If that's the case, we should discard kernel-based implementations and stick to thread-based ones. Postgres cannot do scheduling as efficiently as the kernel, and it shouldn't try. > You would be doing that > anyways for sequential scans and bitmap scans -- and we already do it > with things like synchronised scans and posix_fadvise That only works because the patterns are semi-sequential. If you try to schedule random access, it becomes effectively impossible without hardware info. The kernel is the one with hardware info. > Finally, when I did the posix_fadvise work I wrote a synthetic > benchmark for testing the equivalent i/o pattern of a bitmap scan. It > let me simulate bitmap scans of varying densities with varying > parameters, notably how many i/o to keep in flight at once. It > supported posix_fadvise or aio. You should look it up in the archives, > it made for some nice looking graphs. IIRC I could not find any build > environment where aio offered any performance boost at all. I think > this means I just didn't know how to build it against the right > libraries or wasn't using the right kernel or there was some skew > between them at the time. If it's old, it probable you didn't hit the kernel's braindeadness since it was introduced somewhat recently (somewhate, ie, before 3.x I believe). Even if you did hit it, bitmap heap scans are blessed with sequential ordering. The technique doesn't work nearly as well with random I/O that might be sorted from time to time. When traversing an index, you do a mostly sequential pattern due to physical correlation, but not completely sequential. Not only that, with the assumption of random I/O, and the uncertainty of when will the scan be aborted too, you don't read ahead as much as you could if you knew it was sequential or a full scan. That kills performance. You don't fetch enough ahead of time to avoid stalls, and the kernel doesn't do read-ahead either because posix_fadvise effectively disables it, resulting in the equivalent of direct I/O with bad scheduling. Solving this for index scans isn't just a little more complex. It's insanely more complex, because you need hardware information to do it right. How many spindles, how many sectors per cylinder if it's rotational, how big the segments if it's flash, etc, etc... all stuff hidden away inside the kernel. It's not a good idea to try to do the kernel's job. Aio, even threaded, lets you avoid tha. If you still have the benchmark around, I suggest you shuffle the sectors a little bit (but not fully) and try them with semi-random I/O.
pgsql-hackers by date: