Thread: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
I've always thought that GCC intrinsics for software-based memory prefetching are a very sharp tool. While it's really hard to use GCC's __builtin_prefetch() effectively (I've tried enough times to know), I always thought that there'd probably end up being a handful of cases where using it presented a clear and worthwhile win. Hash joins seemed to me to be a likely candidate, because memory latency is a real killer there. However, I stumbled upon another case where prefetching pays clear dividends while actually being quite simple: sequentially reading an already-sorted memtuples array (an array of SortTuple structs), during some kind of post-processing. This comes up for many common and important cases, including regular sort nodes (heaptuple sorts), datumsorts with pass-by-reference types (e.g. ordered set aggregates), and B-Tree index builds, where the sequential fetching occurs as actual B-Tree leaf pages are built from the array. Patch ===== Attached patch adds a portable Postgres wrapper on the GCC intrinsic. It also adds a client within tuplesort.c -- a common function that seems like a good generic choke point. I can get a speed-up of 6% - 9% for all of these cases by prefetching a few SortTuples ahead, for the "tuple proper". (In-memory sorts only.) ISTM that this alone makes the patch worthwhile, but we should still consider the break-down of costs in each of these operations, and that the cost of the sequential reading and processing of SortTuples is all that has been touched by the patch. There are only about 3 real lines of code added to tuplesort. Obviously, the dominant costs within each of these queries/utility commands are the sort proper, and to a lesser extent the heap scan preceding it. I am not targeting those at all -- I'm only targeting the tertiary cost of sequentially scanning the memtuples array by hiding memory latency, and yet I end up with a notable overall speed-up. It's only really fair to look at this final sequential fetching cost in isolation. The reduction in that cost in isolation seems to be huge. I tried out the proprietary Intel vTune Amplifier utility with this case, and it indicated that CPU time spent in calls to _bt_buildadd was less than 1/8 the previous time in a comparable test of the master branch, with no other changes that I can attribute to this patch (i.e. there is a bit of noise). You can easily get some hint of the size of the improvement by reviewing "trace_sort = on" log output, and the reduction of time spent after "performsort" is done but before the sort is shut down. Tuplestores ---------------- tuplestore.c has a similar optimization added, on the theory that it should match tuplesort.c wherever possible, and because it seems to help a lot there too. For example, queries that perform a big CTE Scan seem to benefit to about the same degree (although, again, only when the array is fully in memory). Portability concerns =============== I started writing this patch with an intuition that this would help. I arrived at the fixed-up version posted here through frobbing the code based on the feedback of Postgres running on my laptop. I iteratively tweaked the number of tuples ahead to fetch from, and the __builtin_prefetch() temporal locality argument's value, which helped a little for a number of test cases. If that doesn't inspire much confidence in this patch, then good -- that's the idea. Obviously if we are going to commit this, there needs to be testing of a reasonably representative selection of platforms. On the other hand, I couldn't find a case that was regressed, and I am encouraged by the fact that this helps the post-processing of SortTuples so effectively. I imagine that some platform would have very different performance characteristics in order for this to be the wrong thing. However, it seems possible that there is a case that has been regressed, since the prefetch is added at a very generic point within tuplesort.c and tuplestore.c. -- Peter Geoghegan
Attachment
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Thu, Jul 16, 2015 at 4:01 PM, Peter Geoghegan <pg@heroku.com> wrote: > Attached patch adds a portable Postgres wrapper on the GCC intrinsic. > It also adds a client within tuplesort.c -- a common function that > seems like a good generic choke point. I can get a speed-up of 6% - 9% > for all of these cases by prefetching a few SortTuples ahead, for the > "tuple proper". (In-memory sorts only.) I added a silly bug during last minute clean-up. I attach a V2. -- Peter Geoghegan
Attachment
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Tom Lane
Date:
Peter Geoghegan <pg@heroku.com> writes: > Attached patch adds a portable Postgres wrapper on the GCC intrinsic. Meh. I don't like the assumption that non-GCC compilers will be smart enough to optimize away the useless-to-them if() tests this adds. Please refactor that so that there is exactly 0 new code when the intrinsic doesn't exist. regards, tom lane
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Thu, Jul 16, 2015 at 8:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Meh. I don't like the assumption that non-GCC compilers will be smart > enough to optimize away the useless-to-them if() tests this adds. > Please refactor that so that there is exactly 0 new code when the > intrinsic doesn't exist. I imagined that there was some value in copying the GCC intrinsic's behavior, and actually evaluating the "addr" expression even in the event of no platform support. On reflection, I suppose that that isn't actually a particularly useful property for Postgres. There will only ever be a handful of callers. Attached revision does not rely on such optimization occurring on platforms that lack __builtin_prefetch(). This allowed me to decouple availability from actual use, in the style of posix_fadvise(), so that one can manually disable memory prefetching within pg_config_manual.h. Clang is compatibile with __builtin_prefetch() intrinsic, FWIW. I'm not sure if it's worth trying to make the wrapper portable across a variety of supported compilers. If we were to attempt it, we would not be the first. I note that ICC's memref_control has an identical interface to __builtin_prefetch(). -- Peter Geoghegan
Attachment
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Andres Freund
Date:
On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote: > +# PGAC_C_BUILTIN_PREFETCH > +# ------------------------- > +# Check if the C compiler understands __builtin_prefetch(), > +# and define HAVE__BUILTIN_PREFETCH if so. > +AC_DEFUN([PGAC_C_BUILTIN_PREFETCH], > +[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch, > +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], > +[int i = 0;__builtin_prefetch(&i, 0, 3);])], > +[pgac_cv__builtin_prefetch=yes], > +[pgac_cv__builtin_prefetch=no])]) > +if test x"$pgac_cv__builtin_prefetch" = xyes ; then > +AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1, > + [Define to 1 if your compiler understands __builtin_prefetch.]) > +fi])# PGAC_C_BUILTIN_PREFETCH Hm. Is a compiler test actually test anything reliably here? Won't this just throw a warning during compile time about an unknown function? > +/* > + * Prefetch support -- Support memory prefetching hints on some platforms. > + * > + * pg_rfetch() is specialized for the case where an array is accessed > + * sequentially, and we can prefetch a pointer within the next element (or an > + * even later element) in order to hide memory latency. This case involves > + * prefetching addresses with low temporal locality. Note that it's rather > + * difficult to get any kind of speedup using pg_rfetch(); any use of the > + * intrinsic should be carefully tested. Also note that it's okay to pass it > + * an invalid or NULL address, although it's best avoided. > + */ > +#if defined(USE_MEM_PREFETCH) > +#define pg_rfetch(addr) __builtin_prefetch((addr), 0, 0) > +#endif What is that name actually supposed to mean? 'rfetch' doesn't seem to be particularly clear - or is it a meant to be a wordplay combined with the p? I don't think you should specify the read/write and locality parameters if we don't hand-pick them - right now you're saying the memory will only be read and that it has no temporal locality. I think there should be a empty fallback definition even if the current only caller ends up not needing it - not all callers will require it. > + /* > + * Perform memory prefetch of tuple that's three places > + * ahead of current (which is returned to caller). > + * Testing shows that this significantly boosts the > + * performance for TSS_INMEM "forward" callers by > + * hiding memory latency behind their processing of > + * returned tuples. > + */ > +#ifdef USE_MEM_PREFETCH > + if (readptr->current + 3 < state->memtupcount) > + pg_rfetch(state->memtuples[readptr->current + 3]); > +#endif Hm. Why do we need a branch here? The target of prefetches don't need to be valid addresses and adding a bunch of arithmetic and a branch for the corner case doesn't sound worthwhile to me. What worries me about adding explicit prefetching is that it's *highly* platform and even micro-architecture dependent. Why is looking three elements ahead the right value? Greetings, Andres Freund
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote: > On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote: > Hm. Is a compiler test actually test anything reliably here? Won't this > just throw a warning during compile time about an unknown function? I'll need to look into that. >> +/* >> + * Prefetch support -- Support memory prefetching hints on some platforms. >> + * >> + * pg_rfetch() is specialized for the case where an array is accessed >> + * sequentially, and we can prefetch a pointer within the next element (or an >> + * even later element) in order to hide memory latency. This case involves >> + * prefetching addresses with low temporal locality. Note that it's rather >> + * difficult to get any kind of speedup using pg_rfetch(); any use of the >> + * intrinsic should be carefully tested. Also note that it's okay to pass it >> + * an invalid or NULL address, although it's best avoided. >> + */ >> +#if defined(USE_MEM_PREFETCH) >> +#define pg_rfetch(addr) __builtin_prefetch((addr), 0, 0) >> +#endif > > What is that name actually supposed to mean? 'rfetch' doesn't seem to be > particularly clear - or is it a meant to be a wordplay combined with the > p? "Read fetch". One argument past to the intrinsic here specifies that the variable will be read only. I did things this way because I imagined that there would be very limited uses for the macro only. I probably cover almost all interesting cases for explicit memory prefetching already. > I don't think you should specify the read/write and locality parameters > if we don't hand-pick them - right now you're saying the memory will > only be read and that it has no temporal locality. > > I think there should be a empty fallback definition even if the current > only caller ends up not needing it - not all callers will require it. It started out that way, but Tom felt that it was better to have a USE_MEM_PREFETCH because of the branch below... >> + /* >> + * Perform memory prefetch of tuple that's three places >> + * ahead of current (which is returned to caller). >> + * Testing shows that this significantly boosts the >> + * performance for TSS_INMEM "forward" callers by >> + * hiding memory latency behind their processing of >> + * returned tuples. >> + */ >> +#ifdef USE_MEM_PREFETCH >> + if (readptr->current + 3 < state->memtupcount) >> + pg_rfetch(state->memtuples[readptr->current + 3]); >> +#endif > > Hm. Why do we need a branch here? The target of prefetches don't need to > be valid addresses and adding a bunch of arithmetic and a branch for the > corner case doesn't sound worthwhile to me. ...which is needed, because I'm still required to not dereference wild pointers. In other words, although pg_rfetch()/__builtin_prefetch() does not require that a valid pointer be passed, it is not okay to read past an array's bounds to read that pointer. The GCC docs are clear on this -- "Data prefetch does not generate faults if addr is invalid, but the address expression itself must be valid". Also, for cases that don't benefit (like datum sorts, which never have a "tuple proper" -- the above code is for tuplestore, not tuplesort, so you can't see this), it seemed faster to have the branch than to rely on __builtin_prefetch() being forgiving about invalid/wild pointers. I saw a regression without the branch for that case. > What worries me about adding explicit prefetching is that it's *highly* > platform and even micro-architecture dependent. Why is looking three > elements ahead the right value? Because that was the fastest value following testing on my laptop. You are absolutely right to point out that this isn't a good reason to go with the patch -- I share your concern. All I can say in defense of that is that other major system software does the same, without any regard for the underlying microarchitecture AFAICT. So, yes, certainly, more testing is required across a reasonable cross-section of platforms to justify the patch. But right now, time spent in _bt_buildadd() is massively reduced with the patch, which makes it worthwhile in my mind. It's really very effective if you look at the only part of (say) a CREATE INDEX that is affected one way or another -- the final fetching of tuples. BTW, I am very close to posting a patch that makes (say) CREATE INDEX very fast for external sorts. That uses a memory pool to make final on-the-fly merge memory access sequential per tape. That's what I'll do rather than explicitly prefetching for external sorts. It makes a huge difference on top of the external sort work already posted, because the final merge step was a big remaining bottleneck which I've now fixed. An unlogged CREATE INDEX with 6 runs to merge is often only 15% slower than an equivalent CREATE INDEX that is all internal (against the master branch). -- Peter Geoghegan
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Andres Freund
Date:
On 2015-09-02 12:24:33 -0700, Peter Geoghegan wrote: > "Read fetch". One argument past to the intrinsic here specifies that > the variable will be read only. I did things this way because I > imagined that there would be very limited uses for the macro only. I > probably cover almost all interesting cases for explicit memory > prefetching already. I'd be less brief in this case then, no need to be super short here. > > I don't think you should specify the read/write and locality parameters > > if we don't hand-pick them - right now you're saying the memory will > > only be read and that it has no temporal locality. > > > > I think there should be a empty fallback definition even if the current > > only caller ends up not needing it - not all callers will require it. > > It started out that way, but Tom felt that it was better to have a > USE_MEM_PREFETCH because of the branch below... That doesn't mean we shouldn't still provide an empty definition. > > Hm. Why do we need a branch here? The target of prefetches don't need to > > be valid addresses and adding a bunch of arithmetic and a branch for the > > corner case doesn't sound worthwhile to me. > > ...which is needed, because I'm still required to not dereference wild > pointers. > In other words, although pg_rfetch()/__builtin_prefetch() > does not require that a valid pointer be passed, it is not okay to > read past an array's bounds to read that pointer. The GCC docs are > clear on this -- "Data prefetch does not generate faults if addr is > invalid, but the address expression itself must be valid". That's just a question of how to formulate this though? pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + offsetof(SortTuple, tuple))? For something heavily platform dependent like this that seems ok. > Because that was the fastest value following testing on my laptop. You > are absolutely right to point out that this isn't a good reason to go > with the patch -- I share your concern. All I can say in defense of > that is that other major system software does the same, without any > regard for the underlying microarchitecture AFAICT. I know linux stripped out most prefetches at some point, even from x86 specific code, because it showed that they aged very badly. I.e. they removed a bunch of them and stuff got faster, whereas they were beneficial on earlier architectures. Greetings, Andres Freund
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 3:13 PM, Andres Freund <andres@anarazel.de> wrote: > I'd be less brief in this case then, no need to be super short here. Okay. >> It started out that way, but Tom felt that it was better to have a >> USE_MEM_PREFETCH because of the branch below... > > That doesn't mean we shouldn't still provide an empty definition. Okay. > That's just a question of how to formulate this though? > > pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + offsetof(SortTuple, tuple))? > > For something heavily platform dependent like this that seems ok. Well, still needs to work for tuplestore, which does not have a SortTuple. >> Because that was the fastest value following testing on my laptop. You >> are absolutely right to point out that this isn't a good reason to go >> with the patch -- I share your concern. All I can say in defense of >> that is that other major system software does the same, without any >> regard for the underlying microarchitecture AFAICT. > > I know linux stripped out most prefetches at some point, even from x86 > specific code, because it showed that they aged very badly. I.e. they > removed a bunch of them and stuff got faster, whereas they were > beneficial on earlier architectures. That is true, but IIRC that was specifically in relation to a commonly used list data structure that had prefetches all over the place. That was a pretty bad idea. I think that explicit prefetching has extremely limited uses, too. The only cases that I can imagine being helped are cases where there is extremely predictable sequential access, but some pointer indirection. Because of the way tuples are fetched across translation unit boundaries in the cases addressed by the patch, it isn't hard to see why the compiler does not do this automatically (prefetch instructions added by the compiler are not common anyway, IIRC). The compiler has no way of knowing that gettuple_common() is ultimately called from an important inner loop, which could make all the difference, I suppose. -- Peter Geoghegan
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Andres Freund
Date:
On 2015-09-02 16:02:00 -0700, Peter Geoghegan wrote: > On Wed, Sep 2, 2015 at 3:13 PM, Andres Freund <andres@anarazel.de> wrote: > > That's just a question of how to formulate this though? > > > > pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + offsetof(SortTuple, tuple))? > > > > For something heavily platform dependent like this that seems ok. > > Well, still needs to work for tuplestore, which does not have a SortTuple. Isn't it even more trivial there? It's just an array of void*'s? So prefetch(state->memtuples + 3 + readptr->current)? > Because of the way tuples are fetched across translation unit > boundaries in the cases addressed by the patch, it isn't hard to see > why the compiler does not do this automatically (prefetch instructions > added by the compiler are not common anyway, IIRC). Hardware prefetchers just have gotten to be rather good and obliterated most of the cases where it's beneficial. I'd be interested to see a perf stat -ddd comparison to the patch with/without prefetches. It'll be interesting to see how the number of cache hits/misses and prefetches changes. Which microarchitecture did you test this on?
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 4:12 PM, Andres Freund <andres@anarazel.de> wrote: >> Well, still needs to work for tuplestore, which does not have a SortTuple. > > Isn't it even more trivial there? It's just an array of void*'s? So > prefetch(state->memtuples + 3 + readptr->current)? All I meant is that there couldn't be one centralized definition for both. I don't mind if you want to bake it into a macro for tupelstore.c and tuplesort.c. >> Because of the way tuples are fetched across translation unit >> boundaries in the cases addressed by the patch, it isn't hard to see >> why the compiler does not do this automatically (prefetch instructions >> added by the compiler are not common anyway, IIRC). > > Hardware prefetchers just have gotten to be rather good and obliterated > most of the cases where it's beneficial. Well, if hardware prefetchers are bright enough to do this perfectly nowadays, then that implies a cost for useless prefetch instructions. It might still be worth it even then, if the cost is very low for these platforms. It's not as if they're required to respect the prefetch hint in any way. > I'd be interested to see a perf stat -ddd comparison to the patch > with/without prefetches. It'll be interesting to see how the number of > cache hits/misses and prefetches changes. > > Which microarchitecture did you test this on? My laptop has an Intel Core i7-3520M, which is a mobile processor that is a bit old. So, Ivy Bridge. -- Peter Geoghegan
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Andres Freund
Date:
On 2015-09-02 16:23:13 -0700, Peter Geoghegan wrote: > On Wed, Sep 2, 2015 at 4:12 PM, Andres Freund <andres@anarazel.de> wrote: > >> Well, still needs to work for tuplestore, which does not have a SortTuple. > > > > Isn't it even more trivial there? It's just an array of void*'s? So > > prefetch(state->memtuples + 3 + readptr->current)? > > All I meant is that there couldn't be one centralized definition for > both. I don't mind if you want to bake it into a macro for > tupelstore.c and tuplesort.c. I'm not following? Just write pg_read_prefetch(state->memtuples + 3 + readptr->current) and the corresponding version for tuplesort in the callsites?
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 4:26 PM, Andres Freund <andres@anarazel.de> wrote: > I'm not following? Just write pg_read_prefetch(state->memtuples + 3 + > readptr->current) and the corresponding version for tuplesort in the > callsites? Oh, I see. Maybe I'll do it that way when I pick this up in a little while. I need to consider the "no tuple proper" (pass-by-value datum sort) case within tuplesort.c, where stup.tuple is always NULL. Currently, we discriminate against such SortTuples for various other reasons (e.g. for memory accounting) by testing if the pointer is set, so I suppose I copied that pattern here. -- Peter Geoghegan
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 12:24 PM, Peter Geoghegan <pg@heroku.com> wrote: > On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote: >> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote: >> Hm. Is a compiler test actually test anything reliably here? Won't this >> just throw a warning during compile time about an unknown function? > > I'll need to look into that. My error here was not considering why the existing __builtin_bswap32() test worked alright with AC_COMPILE_IFELSE(), while my test needed AC_LINK_IFELSE(). I'll be sure to make a point of manually testing failure (not just success) when writing compiler tests in the future. AC_LINK_IFELSE() appears to work fine following a straight substitution. When I get around to revising this patch in a while, that fix will be included. Thanks -- Peter Geoghegan
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Andres Freund
Date:
On 2015-09-02 17:14:12 -0700, Peter Geoghegan wrote: > On Wed, Sep 2, 2015 at 12:24 PM, Peter Geoghegan <pg@heroku.com> wrote: > > On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote: > >> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote: > >> Hm. Is a compiler test actually test anything reliably here? Won't this > >> just throw a warning during compile time about an unknown function? > > > > I'll need to look into that. > > My error here was not considering why the existing __builtin_bswap32() > test worked alright with AC_COMPILE_IFELSE(), while my test needed > AC_LINK_IFELSE(). I'll be sure to make a point of manually testing > failure (not just success) when writing compiler tests in the future. > > AC_LINK_IFELSE() appears to work fine following a straight > substitution. When I get around to revising this patch in a while, > that fix will be included. Be sure to also test with optimizations enabled - often tests can get optimized away if you don't force the result to be used (or volatile) :/
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
David Rowley
Date:
On 3 September 2015 at 07:24, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:
> What worries me about adding explicit prefetching is that it's *highly*
> platform and even micro-architecture dependent. Why is looking three
> elements ahead the right value?
Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT. So, yes,
certainly, more testing is required across a reasonable cross-section
of platforms to justify the patch.
FWIW someone else found 3 to be good on the platform they tested on:
Peter, would you be able to share the test case which you saw the speedup on. So far I've been unable to see much of an improvement.
Regards
David Rowley
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 9:43 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > Peter, would you be able to share the test case which you saw the speedup > on. So far I've been unable to see much of an improvement. The case I tested was an internal sort CREATE INDEX. I don't recall the exact details, but testing showed it to be a fairly robust speedup. It was not a very brief CREATE INDEX operation, or a very lengthily one. trace_sort output made it quite visible that there was a significant saving after the sort is "performed", but before it is "done". It wasn't hard to see an improvement on a variety of other cases, although the Intel vTune tool made the difference particularly obvious. The only thing that definitely won't be helped is pass-by-value datum sort cases. In case it matters, I used GCC 4.8. -- Peter Geoghegan
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
David Rowley
Date:
On 3 September 2015 at 16:50, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Sep 2, 2015 at 9:43 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Peter, would you be able to share the test case which you saw the speedup
> on. So far I've been unable to see much of an improvement.
The case I tested was an internal sort CREATE INDEX. I don't recall
the exact details, but testing showed it to be a fairly robust
speedup. It was not a very brief CREATE INDEX operation, or a very
lengthily one. trace_sort output made it quite visible that there was
a significant saving after the sort is "performed", but before it is
"done". It wasn't hard to see an improvement on a variety of other
cases, although the Intel vTune tool made the difference particularly
obvious.
The only thing that definitely won't be helped is pass-by-value datum
sort cases. In case it matters, I used GCC 4.8.
My test cases are:
set work_mem ='1GB';
create table t1 as select md5(random()::text) from generate_series(1,10000000);
Times are in milliseconds. Median and average over 10 runs.
Test 1
select count(distinct md5) from t1;
Master Patched Median 10,965.77 10,986.30 (99.81%) Average 10,983.63 11,013.55 (99.73%)
Test 2
select sum(rn) from (select row_number() over (order by md5) rn from t1) a;
Master Patched
Median 12,499.03 12,465.21 (100.27%)
Average 12,504.87 12,468.91 (100.29%)
Test 3
create index t1_md5_idx on t1(md5);
Master Patched
Median 12,981.47 12,888.11 (100.72%)
Average 13,416.23 13,249.32 (101.26%)
gcc version 4.8.3
Intel(R) Xeon(R) CPU E5-2630 v3
Are you seeing any speedup from any of these on your hardware?
Regards
David Rowley
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Thu, Sep 3, 2015 at 5:35 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > My test cases are: Note that my text caching and unsigned integer comparison patches have moved the baseline down quite noticeably. I think that my mobile processor out-performs the Xeon you used for this, which seems a little odd even taken the change in baseline performance into account. > set work_mem ='1GB'; > create table t1 as select md5(random()::text) from > generate_series(1,10000000); > > Times are in milliseconds. Median and average over 10 runs. > > Test 1 > select count(distinct md5) from t1; > > Master Patched Median 10,965.77 10,986.30 (99.81%) Average > 10,983.63 11,013.55 (99.73%) > Are you seeing any speedup from any of these on your hardware? I gather that 10,965.77 here means 10,965 milliseconds, since that roughly matches what I get. For the sake of simplicity, I will focus on your test 1 as a baseline. Note that I ran VACUUM FREEZE before any testing was performed, just in case. On my laptop: postgres=# \dt+ t1 List of relations Schema | Name | Type | Owner | Size | Description --------+------+-------+-------+--------+------------- public | t1 | table | pg | 651 MB | (1 row) Times for "Test 1", "select count(distinct md5) from t1": Patch: Time: 10076.870 ms Time: 10094.873 ms Time: 10125.253 ms <-- median Time: 10222.042 ms Time: 10269.247 ms Master: Time: 10641.142 ms Time: 10706.181 ms Time: 10708.860 ms < -- median Time: 10725.426 ms Time: 10781.398 ms So, to answer your question: Yes, I can see a benefit for this query on my test hardware (laptop), although it is not spectacular. It may still be quite worthwhile. I attach a revised version of the patch tested here, following feedback from Andres. This should not make any difference to the performance. It's worth considering that for some (possibly legitimate) reason, the built-in function call is ignored by your compiler, since GCC has license to do that. You might try this on both master and patched builds: ~/postgresql/src/backend/utils/sort$ gdb -batch -ex 'file tuplesort.o' -ex 'disassemble tuplesort_gettuple_common' > prefetch_disassembly.txt Then diff the file prefetch_disassembly.txt for each build to see what the differences are in practice. Consider an extract of the output on my system: ... 0x00000000000028ee <+926>: callq 0x28f3 <tuplesort_gettuple_common+931> 0x00000000000028f3 <+931>: nopl 0x0(%rax,%rax,1) 0x00000000000028f8 <+936>: sub $0x1,%eax 0x00000000000028fb <+939>: test %eax,%eax 0x00000000000028fd <+941>: mov %eax,0xd0(%rdi) 0x0000000000002903 <+947>: jne 0x25ce <tuplesort_gettuple_common+126> 0x0000000000002909 <+953>: jmpq 0x2710 <tuplesort_gettuple_common+448> 0x000000000000290e <+958>: xchg %ax,%ax 0x0000000000002910 <+960>: mov 0x58(%rdi),%rsi 0x0000000000002914 <+964>: lea (%rax,%rax,2),%rax 0x0000000000002918 <+968>: lea (%rsi,%rax,8),%rax 0x000000000000291c <+972>: mov 0x30(%rax),%rax 0x0000000000002920 <+976>: prefetchnta (%rax) 0x0000000000002923 <+979>: mov $0x1,%eax 0x0000000000002928 <+984>: jmpq 0x2712 <tuplesort_gettuple_common+450> 0x000000000000292d <+989>: nopl (%rax) ... Notably, there is a prefetchnta instruction here. Note that I'm going away on vacation in about a week. I wanted to give people feedback on various things before then, since it was overdue. FYI, after Thursday I will be very unlikely to answer e-mail for a couple of weeks. -- Peter Geoghegan
Attachment
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Amir Rohan
Date:
On 10/11/2015 03:20 AM, Peter Geoghegan wrote: > On Thu, Sep 3, 2015 at 5:35 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: >> My test cases are: > > Note that my text caching and unsigned integer comparison patches have > moved the baseline down quite noticeably. I think that my mobile > processor out-performs the Xeon you used for this, which seems a > little odd even taken the change in baseline performance into account. > To add a caveat not yet mentioned, the idea behind prefetching is to scarifice spare memory bandwidth for performance. That can be a winning bet on a quiet box (the one we benchmark on), but can backfire on production db when the extra memory pressure can degrade all running queries. Something to test for, or at least keep in mind. >> set work_mem ='1GB'; >> create table t1 as select md5(random()::text) from >> generate_series(1,10000000); >> >> Times are in milliseconds. Median and average over 10 runs. >> >> Test 1 > I am the reluctant owner of outmoded hardware. Namely a core2 from around 2007 on plain spinning metal. My results (linux 64bit): ------ Test 1 ------ set work_mem ='1GB'; select count(distinct md5) from t1; == Master == 42771.040 ms <- outlier? 41704.570 ms 41631.660 ms 41421.877 ms == Patch == 42469.911 ms <- outlier? 41378.556 ms 41375.870 ms 41118.105 ms 41096.283 ms 41095.705 ms ------ Test 2 ------ select sum(rn) from (select row_number() over (order by md5) rn from t1) a; == Master == 44186.775 ms 44137.154 ms 44111.271 ms 44061.896 ms 44109.122 ms == Patch == 44592.590 ms 44403.076 ms 44276.170 ms very slight difference in an ambiguous direction, but also no perf catastrophe. > It's worth considering that for some (possibly legitimate) reason, the > built-in function call is ignored by your compiler, since GCC has > license to do that. You might try this on both master and patched > builds: > > ~/postgresql/src/backend/utils/sort$ gdb -batch -ex 'file tuplesort.o' > -ex 'disassemble tuplesort_gettuple_common' > prefetch_disassembly.txt > > ... > > Notably, there is a prefetchnta instruction here. > I have verified the prefetech is emitted in the disassembly. An added benefit of owning outmoded hardware is that the MSR for this generation is public and I can disable individual prefetcher units by twiddeling a bit. Disabling the "HW prefetch" or the "DCU prefetch" units on a pacthed version gave results that look relatively unchanged, which seems promising. Disabling them both at once on an unpatched version shows a slowdown of 5-6% in test1 (43347.181, 43898.705, 43399.428). That gives an indication of maximum potential gains in this direction, for this box at least. Finally, I notice my results are 4x slower than everyone else's. That can be very tough on a man's pride, let me tell you. Amir
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
David Rowley
Date:
On 11 October 2015 at 13:20, Peter Geoghegan <pg@heroku.com> wrote:
It's worth considering that for some (possibly legitimate) reason, the
built-in function call is ignored by your compiler, since GCC has
license to do that. You might try this on both master and patched
builds:
You're right, gcc did not include the prefetch instructions.
I've tested again on the same machine but with clang 3.7 instead of gcc 4.8.3
I've conducted the same tests again. All times are in milliseconds. Results are the average and median over 10 runs.
set work_mem ='1GB';
create table t1 as select md5(random()::text) from generate_series(1,10000000);
vacuum freeze t1;
Running 1 query at a time the results are as follows:
Test1: select count(distinct md5) from t1;
Master Patched Gain
Average 10853.679 10132.544 107.12%
Median 10754.193 10005.001 107.49%
Test2: select sum(rn) from (select row_number() over (order by md5) rn from t1) a;
Master Patched Gain
Average 11495.8703 11475.0081 100.18%
Median 11495.6015 11455.944 100.35%
Test3: create index t1_md5_idx on t1(md5);
Master Patched Gain
Average 36464.4632 37830.3879 96.39%
Median 35946.608 36765.0055 97.77%
I also decided to run multiple queries at once, to see if there was any cache pollution problems with the prefetching.
Test 1 pgbench -T 600 -c 16 -j 16 -f test1.sql -n
Test 2 pgbench -T 600 -c 16 -j 16 -f test2.sql -n
(tps output from pgbench was converted to milliseconds with 1/TPS*1000)
Master Patched Gain
Test 1 1375.413 1358.494 101.25%
Test 2 1594.753 1588.340 100.40%
CPU: 1 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz
--
David Rowley http://www.2ndQuadrant.com/
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachment
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Sun, Nov 29, 2015 at 8:52 PM, David Rowley <david.rowley@2ndquadrant.com> wrote: > You're right, gcc did not include the prefetch instructions. > I've tested again on the same machine but with clang 3.7 instead of gcc > 4.8.3 Thanks for going to the trouble of investigating this. These results are mediocre -- in general, it seems like the prefetch instructions are almost as likely to hurt as to help. Naturally, I don't want to go down the road of considering every microarchitecture, and that seems to be what it would take to get this to work well, if that's possible at all. I'm currently running some benchmarks on my external sorting patch on the POWER7 machine that Robert Haas and a few other people have been using for some time now [1]. So far, the benchmarks look very good across a variety of scales. I'll run a round of tests without the prefetching enabled (which the patch series makes further use of -- they're also used when writing tuples out). If there is no significant impact, I'll completely abandon this patch, and we can move on. [1] http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html -- Peter Geoghegan
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Sun, Nov 29, 2015 at 10:14 PM, Peter Geoghegan <pg@heroku.com> wrote: > I'm currently running some benchmarks on my external sorting patch on > the POWER7 machine that Robert Haas and a few other people have been > using for some time now [1]. So far, the benchmarks look very good > across a variety of scales. > > I'll run a round of tests without the prefetching enabled (which the > patch series makes further use of -- they're also used when writing > tuples out). If there is no significant impact, I'll completely > abandon this patch, and we can move on. I took a look at this. It turns out prefetching significantly helps on the POWER7 system, when sorting gensort tables of 50 million, 100 million, 250 million, and 500 million tuples (3 CREATE INDEX tests for each case, 1GB maintenance_work_mem): [pg@hydra gensort]$ cat test_output_patch_1gb.txt | grep "sort ended" LOG: external sort ended, 171063 disk blocks used: CPU 4.33s/71.28u sec elapsed 75.75 sec LOG: external sort ended, 171063 disk blocks used: CPU 4.30s/71.32u sec elapsed 75.91 sec LOG: external sort ended, 171063 disk blocks used: CPU 4.29s/71.34u sec elapsed 75.69 sec LOG: external sort ended, 342124 disk blocks used: CPU 8.10s/165.56u sec elapsed 174.35 sec LOG: external sort ended, 342124 disk blocks used: CPU 8.07s/165.15u sec elapsed 173.70 sec LOG: external sort ended, 342124 disk blocks used: CPU 8.01s/164.73u sec elapsed 174.84 sec LOG: external sort ended, 855306 disk blocks used: CPU 23.65s/491.37u sec elapsed 522.44 sec LOG: external sort ended, 855306 disk blocks used: CPU 21.13s/508.02u sec elapsed 530.48 sec LOG: external sort ended, 855306 disk blocks used: CPU 22.63s/475.33u sec elapsed 499.09 sec LOG: external sort ended, 1710613 disk blocks used: CPU 47.99s/1016.78u sec elapsed 1074.55 sec LOG: external sort ended, 1710613 disk blocks used: CPU 46.52s/1015.25u sec elapsed 1078.23 sec LOG: external sort ended, 1710613 disk blocks used: CPU 44.34s/1013.26u sec elapsed 1067.16 sec [pg@hydra gensort]$ cat test_output_patch_noprefetch_1gb.txt | grep "sort ended" LOG: external sort ended, 171063 disk blocks used: CPU 4.79s/78.14u sec elapsed 83.03 sec LOG: external sort ended, 171063 disk blocks used: CPU 3.85s/77.71u sec elapsed 81.64 sec LOG: external sort ended, 171063 disk blocks used: CPU 3.94s/77.71u sec elapsed 81.71 sec LOG: external sort ended, 342124 disk blocks used: CPU 8.88s/180.15u sec elapsed 189.69 sec LOG: external sort ended, 342124 disk blocks used: CPU 8.30s/179.07u sec elapsed 187.92 sec LOG: external sort ended, 342124 disk blocks used: CPU 8.29s/179.06u sec elapsed 188.02 sec LOG: external sort ended, 855306 disk blocks used: CPU 22.16s/516.86u sec elapsed 541.35 sec LOG: external sort ended, 855306 disk blocks used: CPU 21.66s/513.59u sec elapsed 538.00 sec LOG: external sort ended, 855306 disk blocks used: CPU 22.56s/499.63u sec elapsed 525.53 sec LOG: external sort ended, 1710613 disk blocks used: CPU 45.00s/1062.26u sec elapsed 1118.52 sec LOG: external sort ended, 1710613 disk blocks used: CPU 44.42s/1061.33u sec elapsed 1117.27 sec LOG: external sort ended, 1710613 disk blocks used: CPU 44.47s/1064.93u sec elapsed 1118.79 sec For example, the 50 million tuple test has over 8% of its runtime shaved off. This seems to be a consistent pattern. Note that only the writing of tuples uses prefetching here, because that happens to be the only affected codepath for prefetching (note also that this is the slightly different, external-specific version of the patch). I hesitate to give that up, although it is noticeable that it matters less at higher scales, where we're bottlenecked on quicksorting itself, more so than writing. Those costs grow at different rates, of course. Perhaps we can consider more selectively applying prefetching in the context of writing out tuples. After all, the amount of useful work that we can do pending fetching from memory ought to be more consistent and manageable, which could make it a consistent win. I will need to think about this some more. -- Peter Geoghegan
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Mon, Nov 30, 2015 at 1:04 PM, Peter Geoghegan <pg@heroku.com> wrote: > For example, the 50 million tuple test has over 8% of its runtime > shaved off. This seems to be a consistent pattern. I included the nitty-gritty details of this case in something attached to a mail I just sent, over in the "Quicksort for every external sort run" thread, FWIW. -- Peter Geoghegan
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
From
Peter Geoghegan
Date:
On Mon, Nov 30, 2015 at 1:04 PM, Peter Geoghegan <pg@heroku.com> wrote: > Perhaps we can consider more selectively applying prefetching in the > context of writing out tuples. It may still be worth selectively applying these techniques to writing out tuples, per my previous remarks (and the latest "Using quicksort for every external sort run" revision, which has a specialized version of this patch). I see real benefits there across a variety of situations on different hardware, and I'm hesitant to give that up without further analysis. However, I think that this patch is over as an independent piece of work -- there is too much of a mixed picture. I'm going to mark this patch "returned with feedback". -- Peter Geoghegan