Memory prefetching while sequentially fetching from SortTuple array, tuplestore - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Memory prefetching while sequentially fetching from SortTuple array, tuplestore |
Date | |
Msg-id | CAM3SWZRR-9XX3dk+G_B0U9qLgV9jory1YFWr7YmsAeF3feL_Fg@mail.gmail.com Whole thread Raw |
Responses |
Re: Memory prefetching while sequentially fetching from SortTuple
array, tuplestore
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore |
List | pgsql-hackers |
I've always thought that GCC intrinsics for software-based memory prefetching are a very sharp tool. While it's really hard to use GCC's __builtin_prefetch() effectively (I've tried enough times to know), I always thought that there'd probably end up being a handful of cases where using it presented a clear and worthwhile win. Hash joins seemed to me to be a likely candidate, because memory latency is a real killer there. However, I stumbled upon another case where prefetching pays clear dividends while actually being quite simple: sequentially reading an already-sorted memtuples array (an array of SortTuple structs), during some kind of post-processing. This comes up for many common and important cases, including regular sort nodes (heaptuple sorts), datumsorts with pass-by-reference types (e.g. ordered set aggregates), and B-Tree index builds, where the sequential fetching occurs as actual B-Tree leaf pages are built from the array. Patch ===== Attached patch adds a portable Postgres wrapper on the GCC intrinsic. It also adds a client within tuplesort.c -- a common function that seems like a good generic choke point. I can get a speed-up of 6% - 9% for all of these cases by prefetching a few SortTuples ahead, for the "tuple proper". (In-memory sorts only.) ISTM that this alone makes the patch worthwhile, but we should still consider the break-down of costs in each of these operations, and that the cost of the sequential reading and processing of SortTuples is all that has been touched by the patch. There are only about 3 real lines of code added to tuplesort. Obviously, the dominant costs within each of these queries/utility commands are the sort proper, and to a lesser extent the heap scan preceding it. I am not targeting those at all -- I'm only targeting the tertiary cost of sequentially scanning the memtuples array by hiding memory latency, and yet I end up with a notable overall speed-up. It's only really fair to look at this final sequential fetching cost in isolation. The reduction in that cost in isolation seems to be huge. I tried out the proprietary Intel vTune Amplifier utility with this case, and it indicated that CPU time spent in calls to _bt_buildadd was less than 1/8 the previous time in a comparable test of the master branch, with no other changes that I can attribute to this patch (i.e. there is a bit of noise). You can easily get some hint of the size of the improvement by reviewing "trace_sort = on" log output, and the reduction of time spent after "performsort" is done but before the sort is shut down. Tuplestores ---------------- tuplestore.c has a similar optimization added, on the theory that it should match tuplesort.c wherever possible, and because it seems to help a lot there too. For example, queries that perform a big CTE Scan seem to benefit to about the same degree (although, again, only when the array is fully in memory). Portability concerns =============== I started writing this patch with an intuition that this would help. I arrived at the fixed-up version posted here through frobbing the code based on the feedback of Postgres running on my laptop. I iteratively tweaked the number of tuples ahead to fetch from, and the __builtin_prefetch() temporal locality argument's value, which helped a little for a number of test cases. If that doesn't inspire much confidence in this patch, then good -- that's the idea. Obviously if we are going to commit this, there needs to be testing of a reasonably representative selection of platforms. On the other hand, I couldn't find a case that was regressed, and I am encouraged by the fact that this helps the post-processing of SortTuples so effectively. I imagine that some platform would have very different performance characteristics in order for this to be the wrong thing. However, it seems possible that there is a case that has been regressed, since the prefetch is added at a very generic point within tuplesort.c and tuplestore.c. -- Peter Geoghegan
Attachment
pgsql-hackers by date: