Re: Extended Prefetching using Asynchronous IO - proposal and patch - Mailing list pgsql-hackers

From Claudio Freire
Subject Re: Extended Prefetching using Asynchronous IO - proposal and patch
Date
Msg-id CAGTBQpbw56ygQbuZ0tWGbECWwxsmsEq-kZxNvfmvYsC_6u3wPg@mail.gmail.com
Whole thread Raw
In response to Re: Extended Prefetching using Asynchronous IO - proposal and patch  (Greg Stark <stark@mit.edu>)
Responses Re: Extended Prefetching using Asynchronous IO - proposal and patch  (John Lumby <johnlumby@hotmail.com>)
List pgsql-hackers
On Thu, Jun 19, 2014 at 2:49 PM, Greg Stark <stark@mit.edu> wrote:
> I don't think the threaded implementation on Linux is the one to use
> though. I find this *super* confusing but the kernel definitely
> supports aio syscalls, glibc also has a threaded implementation it
> uses if run on a kernel that doesn't implement the syscalls, and I
> think there are existing libaio and librt libraries from outside glibc
> that do one or the other. Which you build against seems to make a big
> difference. My instinct is that anything but the kernel native
> implementation will be worthless. The overhead of thread communication
> will completely outweigh any advantage over posix_fadvise's partial
> win.

What overhead?

The only communication is through a "done" bit and associated
synchronization structure when *and only when* you want to wait on it.

Furthermore, posix_fadvise is braindead on this use case, been there,
done that. What you win with threads is a better postgres-kernel
interaction, even if you loose some CPU performance it's gonna beat
posix_fadvise by a large margin.


> The main advantage of posix aio is that we can actually receive the
> data out of order. With posix_fadvise we can get the i/o and cpu
> overlap but we will never process the later blocks until the earlier
> requests are satisfied and processed in order. With aio you could do a
> sequential scan, initiating i/o on 1,000 blocks and then processing
> them as they arrive, initiating new requests as those blocks are
> handled.

And each and every I/O you issue with it counts as such on the kernel side.

It's not the case with posix_fadvise, mind you, and that's terribly
damaging for performance.

> When I investigated this I found the buffer manager's I/O bits seemed
> to already be able to represent the state we needed (i/o initiated on
> this buffer but not completed). The problem was in ensuring that a
> backend would process the i/o completion promptly when it might be in
> the midst of handling other tasks and might even get an elog() stack
> unwinding. The interface that actually fits Postgres best might be the
> threaded interface (orthogonal to the threaded implementation
> question) which is you give aio a callback which gets called on a
> separate thread when the i/o completes. The alternative is you give
> aio a list of operation control blocks and it tells you the state of
> all the i/o operations. But it's not clear to me how you arrange to do
> that regularly, promptly, and reliably.

Indeed we've been musing about using librt's support of completion
callbacks for that.

> The other gotcha here is that the kernel implementation only does
> anything useful on DIRECT_IO files. That means you have to do *all*
> the prefetching and i/o scheduling yourself.

If that's the case, we should discard kernel-based implementations and
stick to thread-based ones. Postgres cannot do scheduling as
efficiently as the kernel, and it shouldn't try.

> You would be doing that
> anyways for sequential scans and bitmap scans -- and we already do it
> with things like synchronised scans and posix_fadvise

That only works because the patterns are semi-sequential. If you try
to schedule random access, it becomes effectively impossible without
hardware info.

The kernel is the one with hardware info.

> Finally, when I did the posix_fadvise work I wrote a synthetic
> benchmark for testing the equivalent i/o pattern of a bitmap scan. It
> let me simulate bitmap scans of varying densities with varying
> parameters, notably how many i/o to keep in flight at once. It
> supported posix_fadvise or aio. You should look it up in the archives,
> it made for some nice looking graphs. IIRC I could not find any build
> environment where aio offered any performance boost at all. I think
> this means I just didn't know how to build it against the right
> libraries or wasn't using the right kernel or there was some skew
> between them at the time.

If it's old, it probable you didn't hit the kernel's braindeadness
since it was introduced somewhat recently (somewhate, ie, before 3.x I
believe).

Even if you did hit it, bitmap heap scans are blessed with sequential
ordering. The technique doesn't work nearly as well with random I/O
that might be sorted from time to time.

When traversing an index, you do a mostly sequential pattern due to
physical correlation, but not completely sequential. Not only that,
with the assumption of random I/O, and the uncertainty of when will
the scan be aborted too, you don't read ahead as much as you could if
you knew it was sequential or a full scan. That kills performance. You
don't fetch enough ahead of time to avoid stalls, and the kernel
doesn't do read-ahead either because posix_fadvise effectively
disables it, resulting in the equivalent of direct I/O with bad
scheduling.

Solving this for index scans isn't just a little more complex. It's
insanely more complex, because you need hardware information to do it
right. How many spindles, how many sectors per cylinder if it's
rotational, how big the segments if it's flash, etc, etc... all stuff
hidden away inside the kernel.

It's not a good idea to try to do the kernel's job. Aio, even
threaded, lets you avoid tha.

If you still have the benchmark around, I suggest you shuffle the
sectors a little bit (but not fully) and try them with semi-random
I/O.



pgsql-hackers by date:

Previous
From: Greg Stark
Date:
Subject: Re: Extended Prefetching using Asynchronous IO - proposal and patch
Next
From: Claudio Freire
Date:
Subject: Re: Minmax indexes