Thread: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
I've always thought that GCC intrinsics for software-based memory
prefetching are a very sharp tool. While it's really hard to use GCC's
__builtin_prefetch() effectively (I've tried enough times to know), I
always thought that there'd probably end up being a handful of cases
where using it presented a clear and worthwhile win. Hash joins seemed
to me to be a likely candidate, because memory latency is a real
killer there. However, I stumbled upon another case where prefetching
pays clear dividends while actually being quite simple: sequentially
reading an already-sorted memtuples array (an array of SortTuple
structs), during some kind of post-processing.

This comes up for many common and important cases, including regular
sort nodes (heaptuple sorts), datumsorts with pass-by-reference types
(e.g. ordered set aggregates), and B-Tree index builds, where the
sequential fetching occurs as actual B-Tree leaf pages are built from
the array.

Patch
=====

Attached patch adds a portable Postgres wrapper on the GCC intrinsic.
It also adds a client within tuplesort.c -- a common function that
seems like a good generic choke point. I can get a speed-up of 6% - 9%
for all of these cases by prefetching a few SortTuples ahead, for the
"tuple proper". (In-memory sorts only.)

ISTM that this alone makes the patch worthwhile, but we should still
consider the break-down of costs in each of these operations, and that
the cost of the sequential reading and processing of SortTuples is all
that has been touched by the patch. There are only about 3 real lines
of code added to tuplesort. Obviously, the dominant costs within each
of these queries/utility commands are the sort proper, and to a lesser
extent the heap scan preceding it. I am not targeting those at all --
I'm only targeting the tertiary cost of sequentially scanning the
memtuples array by hiding memory latency, and yet I end up with a
notable overall speed-up.

It's only really fair to look at this final sequential fetching cost
in isolation. The reduction in that cost in isolation seems to be
huge. I tried out the proprietary Intel vTune Amplifier utility with
this case, and it indicated that CPU time spent in calls to
_bt_buildadd was less than 1/8 the previous time in a comparable test
of the master branch, with no other changes that I can attribute to
this patch (i.e. there is a bit of noise). You can easily get some
hint of the size of the improvement by reviewing "trace_sort = on" log
output, and the reduction of time spent after "performsort" is done
but before the sort is shut down.

Tuplestores
----------------

tuplestore.c has a similar optimization added, on the theory that it
should match tuplesort.c wherever possible, and because it seems to
help a lot there too. For example, queries that perform a big CTE Scan
seem to benefit to about the same degree (although, again, only when
the array is fully in memory).

Portability concerns
===============

I started writing this patch with an intuition that this would help. I
arrived at the fixed-up version posted here through frobbing the code
based on the feedback of Postgres running on my laptop. I iteratively
tweaked the number of tuples ahead to fetch from, and the
__builtin_prefetch() temporal locality argument's value, which helped
a little for a number of test cases. If that doesn't inspire much
confidence in this patch, then good -- that's the idea. Obviously if
we are going to commit this, there needs to be testing of a reasonably
representative selection of platforms.

On the other hand, I couldn't find a case that was regressed, and I am
encouraged by the fact that this helps the post-processing of
SortTuples so effectively. I imagine that some platform would have
very different performance characteristics in order for this to be the
wrong thing. However, it seems possible that there is a case that has
been regressed, since the prefetch is added at a very generic point
within tuplesort.c and tuplestore.c.
--
Peter Geoghegan

Attachment
On Thu, Jul 16, 2015 at 4:01 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Attached patch adds a portable Postgres wrapper on the GCC intrinsic.
> It also adds a client within tuplesort.c -- a common function that
> seems like a good generic choke point. I can get a speed-up of 6% - 9%
> for all of these cases by prefetching a few SortTuples ahead, for the
> "tuple proper". (In-memory sorts only.)

I added a silly bug during last minute clean-up. I attach a V2.

--
Peter Geoghegan

Attachment
Peter Geoghegan <pg@heroku.com> writes:
> Attached patch adds a portable Postgres wrapper on the GCC intrinsic.

Meh.  I don't like the assumption that non-GCC compilers will be smart
enough to optimize away the useless-to-them if() tests this adds.
Please refactor that so that there is exactly 0 new code when the
intrinsic doesn't exist.
        regards, tom lane



On Thu, Jul 16, 2015 at 8:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Meh.  I don't like the assumption that non-GCC compilers will be smart
> enough to optimize away the useless-to-them if() tests this adds.
> Please refactor that so that there is exactly 0 new code when the
> intrinsic doesn't exist.

I imagined that there was some value in copying the GCC intrinsic's
behavior, and actually evaluating the "addr" expression even in the
event of no platform support. On reflection, I suppose that that isn't
actually a particularly useful property for Postgres. There will only
ever be a handful of callers.

Attached revision does not rely on such optimization occurring on
platforms that lack __builtin_prefetch(). This allowed me to decouple
availability from actual use, in the style of posix_fadvise(), so that
one can manually disable memory prefetching within pg_config_manual.h.

Clang is compatibile with __builtin_prefetch() intrinsic, FWIW. I'm
not sure if it's worth trying to make the wrapper portable across a
variety of supported compilers. If we were to attempt it, we would not
be the first. I note that ICC's memref_control has an identical
interface to __builtin_prefetch().

--
Peter Geoghegan

Attachment

Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Andres Freund
Date:
On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
> +# PGAC_C_BUILTIN_PREFETCH
> +# -------------------------
> +# Check if the C compiler understands __builtin_prefetch(),
> +# and define HAVE__BUILTIN_PREFETCH if so.
> +AC_DEFUN([PGAC_C_BUILTIN_PREFETCH],
> +[AC_CACHE_CHECK(for __builtin_prefetch, pgac_cv__builtin_prefetch,
> +[AC_COMPILE_IFELSE([AC_LANG_PROGRAM([],
> +[int i = 0;__builtin_prefetch(&i, 0, 3);])],
> +[pgac_cv__builtin_prefetch=yes],
> +[pgac_cv__builtin_prefetch=no])])
> +if test x"$pgac_cv__builtin_prefetch" = xyes ; then
> +AC_DEFINE(HAVE__BUILTIN_PREFETCH, 1,
> +          [Define to 1 if your compiler understands __builtin_prefetch.])
> +fi])# PGAC_C_BUILTIN_PREFETCH

Hm. Is a compiler test actually test anything reliably here? Won't this
just throw a warning during compile time about an unknown function?

> +/*
> + * Prefetch support -- Support memory prefetching hints on some platforms.
> + *
> + * pg_rfetch() is specialized for the case where an array is accessed
> + * sequentially, and we can prefetch a pointer within the next element (or an
> + * even later element) in order to hide memory latency.  This case involves
> + * prefetching addresses with low temporal locality.  Note that it's rather
> + * difficult to get any kind of speedup using pg_rfetch();  any use of the
> + * intrinsic should be carefully tested.  Also note that it's okay to pass it
> + * an invalid or NULL address, although it's best avoided.
> + */
> +#if defined(USE_MEM_PREFETCH)
> +#define pg_rfetch(addr)        __builtin_prefetch((addr), 0, 0)
> +#endif

What is that name actually supposed to mean? 'rfetch' doesn't seem to be
particularly clear - or is it a meant to be a wordplay combined with the
p?

I don't think you should specify the read/write and locality parameters
if we don't hand-pick them - right now you're saying the memory will
only be read and that it has no temporal locality.

I think there should be a empty fallback definition even if the current
only caller ends up not needing it - not all callers will require it.

> +                    /*
> +                     * Perform memory prefetch of tuple that's three places
> +                     * ahead of current (which is returned to caller).
> +                     * Testing shows that this significantly boosts the
> +                     * performance for TSS_INMEM "forward" callers by
> +                     * hiding memory latency behind their processing of
> +                     * returned tuples.
> +                     */
> +#ifdef USE_MEM_PREFETCH
> +                    if (readptr->current + 3 < state->memtupcount)
> +                        pg_rfetch(state->memtuples[readptr->current + 3]);
> +#endif

Hm. Why do we need a branch here? The target of prefetches don't need to
be valid addresses and adding a bunch of arithmetic and a branch for the
corner case doesn't sound worthwhile to me.


What worries me about adding explicit prefetching is that it's *highly*
platform and even micro-architecture dependent. Why is looking three
elements ahead the right value?

Greetings,

Andres Freund



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
> Hm. Is a compiler test actually test anything reliably here? Won't this
> just throw a warning during compile time about an unknown function?

I'll need to look into that.

>> +/*
>> + * Prefetch support -- Support memory prefetching hints on some platforms.
>> + *
>> + * pg_rfetch() is specialized for the case where an array is accessed
>> + * sequentially, and we can prefetch a pointer within the next element (or an
>> + * even later element) in order to hide memory latency.  This case involves
>> + * prefetching addresses with low temporal locality.  Note that it's rather
>> + * difficult to get any kind of speedup using pg_rfetch();  any use of the
>> + * intrinsic should be carefully tested.  Also note that it's okay to pass it
>> + * an invalid or NULL address, although it's best avoided.
>> + */
>> +#if defined(USE_MEM_PREFETCH)
>> +#define pg_rfetch(addr)              __builtin_prefetch((addr), 0, 0)
>> +#endif
>
> What is that name actually supposed to mean? 'rfetch' doesn't seem to be
> particularly clear - or is it a meant to be a wordplay combined with the
> p?

"Read fetch". One argument past to the intrinsic here specifies that
the variable will be read only. I did things this way because I
imagined that there would be very limited uses for the macro only. I
probably cover almost all interesting cases for explicit memory
prefetching already.

> I don't think you should specify the read/write and locality parameters
> if we don't hand-pick them - right now you're saying the memory will
> only be read and that it has no temporal locality.
>
> I think there should be a empty fallback definition even if the current
> only caller ends up not needing it - not all callers will require it.

It started out that way, but Tom felt that it was better to have a
USE_MEM_PREFETCH because of the branch below...

>> +                                     /*
>> +                                      * Perform memory prefetch of tuple that's three places
>> +                                      * ahead of current (which is returned to caller).
>> +                                      * Testing shows that this significantly boosts the
>> +                                      * performance for TSS_INMEM "forward" callers by
>> +                                      * hiding memory latency behind their processing of
>> +                                      * returned tuples.
>> +                                      */
>> +#ifdef USE_MEM_PREFETCH
>> +                                     if (readptr->current + 3 < state->memtupcount)
>> +                                             pg_rfetch(state->memtuples[readptr->current + 3]);
>> +#endif
>
> Hm. Why do we need a branch here? The target of prefetches don't need to
> be valid addresses and adding a bunch of arithmetic and a branch for the
> corner case doesn't sound worthwhile to me.

...which is needed, because I'm still required to not dereference wild
pointers. In other words, although pg_rfetch()/__builtin_prefetch()
does not require that a valid pointer be passed, it is not okay to
read past an array's bounds to read that pointer. The GCC docs are
clear on this -- "Data prefetch does not generate faults if addr is
invalid, but the address expression itself must be valid". Also, for
cases that don't benefit (like datum sorts, which never have a "tuple
proper" -- the above code is for tuplestore, not tuplesort, so you
can't see this), it seemed faster to have the branch than to rely on
__builtin_prefetch() being forgiving about invalid/wild pointers. I
saw a regression without the branch for that case.

> What worries me about adding explicit prefetching is that it's *highly*
> platform and even micro-architecture dependent. Why is looking three
> elements ahead the right value?

Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT. So, yes,
certainly, more testing is required across a reasonable cross-section
of platforms to justify the patch. But right now, time spent in
_bt_buildadd() is massively reduced with the patch, which makes it
worthwhile in my mind. It's really very effective if you look at the
only part of (say) a CREATE INDEX that is affected one way or another
-- the final fetching of tuples.

BTW, I am very close to posting a patch that makes (say) CREATE INDEX
very fast for external sorts. That uses a memory pool to make final
on-the-fly merge memory access sequential per tape. That's what I'll
do rather than explicitly prefetching for external sorts. It makes a
huge difference on top of the external sort work already posted,
because the final merge step was a big remaining bottleneck which I've
now fixed. An unlogged CREATE INDEX with 6 runs to merge is often only
15% slower than an equivalent CREATE INDEX that is all internal
(against the master branch).

-- 
Peter Geoghegan



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Andres Freund
Date:
On 2015-09-02 12:24:33 -0700, Peter Geoghegan wrote:
> "Read fetch". One argument past to the intrinsic here specifies that
> the variable will be read only. I did things this way because I
> imagined that there would be very limited uses for the macro only. I
> probably cover almost all interesting cases for explicit memory
> prefetching already.

I'd be less brief in this case then, no need to be super short here.

> > I don't think you should specify the read/write and locality parameters
> > if we don't hand-pick them - right now you're saying the memory will
> > only be read and that it has no temporal locality.
> >
> > I think there should be a empty fallback definition even if the current
> > only caller ends up not needing it - not all callers will require it.
> 
> It started out that way, but Tom felt that it was better to have a
> USE_MEM_PREFETCH because of the branch below...

That doesn't mean we shouldn't still provide an empty definition.

> > Hm. Why do we need a branch here? The target of prefetches don't need to
> > be valid addresses and adding a bunch of arithmetic and a branch for the
> > corner case doesn't sound worthwhile to me.
> 
> ...which is needed, because I'm still required to not dereference wild
> pointers. > In other words, although pg_rfetch()/__builtin_prefetch()
> does not require that a valid pointer be passed, it is not okay to
> read past an array's bounds to read that pointer. The GCC docs are
> clear on this -- "Data prefetch does not generate faults if addr is
> invalid, but the address expression itself must be valid".

That's just a question of how to formulate this though?

pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + offsetof(SortTuple, tuple))?

For something heavily platform dependent like this that seems ok.

> Because that was the fastest value following testing on my laptop. You
> are absolutely right to point out that this isn't a good reason to go
> with the patch -- I share your concern. All I can say in defense of
> that is that other major system software does the same, without any
> regard for the underlying microarchitecture AFAICT.

I know linux stripped out most prefetches at some point, even from x86
specific code, because it showed that they aged very badly. I.e. they
removed a bunch of them and stuff got faster, whereas they were
beneficial on earlier architectures.

Greetings,

Andres Freund



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 3:13 PM, Andres Freund <andres@anarazel.de> wrote:
> I'd be less brief in this case then, no need to be super short here.

Okay.

>> It started out that way, but Tom felt that it was better to have a
>> USE_MEM_PREFETCH because of the branch below...
>
> That doesn't mean we shouldn't still provide an empty definition.

Okay.

> That's just a question of how to formulate this though?
>
> pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + offsetof(SortTuple, tuple))?
>
> For something heavily platform dependent like this that seems ok.

Well, still needs to work for tuplestore, which does not have a SortTuple.

>> Because that was the fastest value following testing on my laptop. You
>> are absolutely right to point out that this isn't a good reason to go
>> with the patch -- I share your concern. All I can say in defense of
>> that is that other major system software does the same, without any
>> regard for the underlying microarchitecture AFAICT.
>
> I know linux stripped out most prefetches at some point, even from x86
> specific code, because it showed that they aged very badly. I.e. they
> removed a bunch of them and stuff got faster, whereas they were
> beneficial on earlier architectures.

That is true, but IIRC that was specifically in relation to a commonly
used list data structure that had prefetches all over the place. That
was a pretty bad idea.

I think that explicit prefetching has extremely limited uses, too. The
only cases that I can imagine being helped are cases where there is
extremely predictable sequential access, but some pointer indirection.

Because of the way tuples are fetched across translation unit
boundaries in the cases addressed by the patch, it isn't hard to see
why the compiler does not do this automatically (prefetch instructions
added by the compiler are not common anyway, IIRC). The compiler has
no way of knowing that gettuple_common() is ultimately called from an
important inner loop, which could make all the difference, I suppose.

-- 
Peter Geoghegan



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Andres Freund
Date:
On 2015-09-02 16:02:00 -0700, Peter Geoghegan wrote:
> On Wed, Sep 2, 2015 at 3:13 PM, Andres Freund <andres@anarazel.de> wrote:
> > That's just a question of how to formulate this though?
> >
> > pg_rfetch(((char *) state->memtuples ) + 3 * sizeof(SortTuple) + offsetof(SortTuple, tuple))?
> >
> > For something heavily platform dependent like this that seems ok.
> 
> Well, still needs to work for tuplestore, which does not have a SortTuple.

Isn't it even more trivial there? It's just an array of void*'s? So
prefetch(state->memtuples + 3 + readptr->current)?

> Because of the way tuples are fetched across translation unit
> boundaries in the cases addressed by the patch, it isn't hard to see
> why the compiler does not do this automatically (prefetch instructions
> added by the compiler are not common anyway, IIRC).

Hardware prefetchers just have gotten to be rather good and obliterated
most of the cases where it's beneficial.

I'd be interested to see a perf stat -ddd comparison to the patch
with/without prefetches. It'll be interesting to see how the number of
cache hits/misses and prefetches changes.

Which microarchitecture did you test this on?



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 4:12 PM, Andres Freund <andres@anarazel.de> wrote:
>> Well, still needs to work for tuplestore, which does not have a SortTuple.
>
> Isn't it even more trivial there? It's just an array of void*'s? So
> prefetch(state->memtuples + 3 + readptr->current)?

All I meant is that there couldn't be one centralized definition for
both. I don't mind if you want to bake it into a macro for
tupelstore.c and tuplesort.c.

>> Because of the way tuples are fetched across translation unit
>> boundaries in the cases addressed by the patch, it isn't hard to see
>> why the compiler does not do this automatically (prefetch instructions
>> added by the compiler are not common anyway, IIRC).
>
> Hardware prefetchers just have gotten to be rather good and obliterated
> most of the cases where it's beneficial.

Well, if hardware prefetchers are bright enough to do this perfectly
nowadays, then that implies a cost for useless prefetch instructions.
It might still be worth it even then, if the cost is very low for
these platforms. It's not as if they're required to respect the
prefetch hint in any way.

> I'd be interested to see a perf stat -ddd comparison to the patch
> with/without prefetches. It'll be interesting to see how the number of
> cache hits/misses and prefetches changes.
>
> Which microarchitecture did you test this on?

My laptop has an Intel Core i7-3520M, which is a mobile processor that
is a bit old. So, Ivy Bridge.

-- 
Peter Geoghegan



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Andres Freund
Date:
On 2015-09-02 16:23:13 -0700, Peter Geoghegan wrote:
> On Wed, Sep 2, 2015 at 4:12 PM, Andres Freund <andres@anarazel.de> wrote:
> >> Well, still needs to work for tuplestore, which does not have a SortTuple.
> >
> > Isn't it even more trivial there? It's just an array of void*'s? So
> > prefetch(state->memtuples + 3 + readptr->current)?
> 
> All I meant is that there couldn't be one centralized definition for
> both. I don't mind if you want to bake it into a macro for
> tupelstore.c and tuplesort.c.

I'm not following? Just write pg_read_prefetch(state->memtuples + 3 +
readptr->current) and the corresponding version for tuplesort in the
callsites?



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 4:26 PM, Andres Freund <andres@anarazel.de> wrote:
> I'm not following? Just write pg_read_prefetch(state->memtuples + 3 +
> readptr->current) and the corresponding version for tuplesort in the
> callsites?

Oh, I see. Maybe I'll do it that way when I pick this up in a little
while. I need to consider the "no tuple proper" (pass-by-value datum
sort) case within tuplesort.c, where stup.tuple is always NULL.
Currently, we discriminate against such SortTuples for various other
reasons (e.g. for memory accounting) by testing if the pointer is set,
so I suppose I copied that pattern here.

-- 
Peter Geoghegan



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 12:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
> On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
>> Hm. Is a compiler test actually test anything reliably here? Won't this
>> just throw a warning during compile time about an unknown function?
>
> I'll need to look into that.

My error here was not considering why the existing __builtin_bswap32()
test worked alright with AC_COMPILE_IFELSE(), while my test needed
AC_LINK_IFELSE(). I'll be sure to make a point of manually testing
failure (not just success) when writing compiler tests in the future.

AC_LINK_IFELSE() appears to work fine following a straight
substitution. When I get around to revising this patch in a while,
that fix will be included.

Thanks
-- 
Peter Geoghegan



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Andres Freund
Date:
On 2015-09-02 17:14:12 -0700, Peter Geoghegan wrote:
> On Wed, Sep 2, 2015 at 12:24 PM, Peter Geoghegan <pg@heroku.com> wrote:
> > On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:
> >> On 2015-07-19 16:34:52 -0700, Peter Geoghegan wrote:
> >> Hm. Is a compiler test actually test anything reliably here? Won't this
> >> just throw a warning during compile time about an unknown function?
> >
> > I'll need to look into that.
> 
> My error here was not considering why the existing __builtin_bswap32()
> test worked alright with AC_COMPILE_IFELSE(), while my test needed
> AC_LINK_IFELSE(). I'll be sure to make a point of manually testing
> failure (not just success) when writing compiler tests in the future.
> 
> AC_LINK_IFELSE() appears to work fine following a straight
> substitution. When I get around to revising this patch in a while,
> that fix will be included.

Be sure to also test with optimizations enabled - often tests can get
optimized away if you don't force the result to be used (or volatile) :/



On 3 September 2015 at 07:24, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Sep 2, 2015 at 7:12 AM, Andres Freund <andres@anarazel.de> wrote:
> What worries me about adding explicit prefetching is that it's *highly*
> platform and even micro-architecture dependent. Why is looking three
> elements ahead the right value?

Because that was the fastest value following testing on my laptop. You
are absolutely right to point out that this isn't a good reason to go
with the patch -- I share your concern. All I can say in defense of
that is that other major system software does the same, without any
regard for the underlying microarchitecture AFAICT. So, yes,
certainly, more testing is required across a reasonable cross-section
of platforms to justify the patch. 

FWIW someone else found 3 to be good on the platform they tested on:

Peter, would you be able to share the test case which you saw the speedup on. So far I've been unable to see much of an improvement.

Regards

David Rowley

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
 

Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Wed, Sep 2, 2015 at 9:43 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Peter, would you be able to share the test case which you saw the speedup
> on. So far I've been unable to see much of an improvement.

The case I tested was an internal sort CREATE INDEX. I don't recall
the exact details, but testing showed it to be a fairly robust
speedup. It was not a very brief CREATE INDEX operation, or a very
lengthily one. trace_sort output made it quite visible that there was
a significant saving after the sort is "performed", but before it is
"done". It wasn't hard to see an improvement on a variety of other
cases, although the Intel vTune tool made the difference particularly
obvious.

The only thing that definitely won't be helped is pass-by-value datum
sort cases. In case it matters, I used GCC 4.8.

-- 
Peter Geoghegan



On 3 September 2015 at 16:50, Peter Geoghegan <pg@heroku.com> wrote:
On Wed, Sep 2, 2015 at 9:43 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> Peter, would you be able to share the test case which you saw the speedup
> on. So far I've been unable to see much of an improvement.

The case I tested was an internal sort CREATE INDEX. I don't recall
the exact details, but testing showed it to be a fairly robust
speedup. It was not a very brief CREATE INDEX operation, or a very
lengthily one. trace_sort output made it quite visible that there was
a significant saving after the sort is "performed", but before it is
"done". It wasn't hard to see an improvement on a variety of other
cases, although the Intel vTune tool made the difference particularly
obvious.

The only thing that definitely won't be helped is pass-by-value datum
sort cases. In case it matters, I used GCC 4.8.

My test cases are:

set work_mem ='1GB';
create table t1 as select md5(random()::text) from generate_series(1,10000000);

Times are in milliseconds. Median and average over 10 runs.

Test 1
select count(distinct md5) from t1;

                 Master Patched Median 10,965.77 10,986.30 (99.81%) Average 10,983.63 11,013.55 (99.73%)

Test 2
select sum(rn) from (select row_number() over (order by md5) rn from t1) a;

Master Patched
Median 12,499.03 12,465.21 (100.27%)
Average 12,504.87 12,468.91 (100.29%)

Test 3
create index t1_md5_idx on t1(md5);
Master Patched
Median 12,981.47 12,888.11 (100.72%)
Average 13,416.23 13,249.32 (101.26%)

gcc version 4.8.3
Intel(R) Xeon(R) CPU E5-2630 v3 

Are you seeing any speedup from any of these on your hardware?

Regards

David Rowley

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
 

Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Thu, Sep 3, 2015 at 5:35 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> My test cases are:

Note that my text caching and unsigned integer comparison patches have
moved the baseline down quite noticeably. I think that my mobile
processor out-performs the Xeon you used for this, which seems a
little odd even taken the change in baseline performance into account.

> set work_mem ='1GB';
> create table t1 as select md5(random()::text) from
> generate_series(1,10000000);
>
> Times are in milliseconds. Median and average over 10 runs.
>
> Test 1
> select count(distinct md5) from t1;
>
>                  Master Patched Median 10,965.77 10,986.30 (99.81%) Average
> 10,983.63 11,013.55 (99.73%)

> Are you seeing any speedup from any of these on your hardware?

I gather that 10,965.77 here means 10,965 milliseconds, since that
roughly matches what I get.

For the sake of simplicity, I will focus on your test 1 as a baseline.
Note that I ran VACUUM FREEZE before any testing was performed, just
in case.

On my laptop:

postgres=# \dt+ t1
                  List of relations
 Schema | Name | Type  | Owner |  Size  | Description
--------+------+-------+-------+--------+-------------
 public | t1   | table | pg    | 651 MB |
(1 row)

Times for "Test 1", "select count(distinct md5) from t1":

Patch:

Time: 10076.870 ms
Time: 10094.873 ms
Time: 10125.253 ms  <-- median
Time: 10222.042 ms
Time: 10269.247 ms

Master:

Time: 10641.142 ms
Time: 10706.181 ms
Time: 10708.860 ms < -- median
Time: 10725.426 ms
Time: 10781.398 ms

So, to answer your question: Yes, I can see a benefit for this query
on my test hardware (laptop), although it is not spectacular. It may
still be quite worthwhile.

I attach a revised version of the patch tested here, following
feedback from Andres. This should not make any difference to the
performance.

It's worth considering that for some (possibly legitimate) reason, the
built-in function call is ignored by your compiler, since GCC has
license to do that. You might try this on both master and patched
builds:

~/postgresql/src/backend/utils/sort$ gdb -batch -ex 'file tuplesort.o'
-ex 'disassemble tuplesort_gettuple_common' > prefetch_disassembly.txt

Then diff the file prefetch_disassembly.txt for each build to see what
the differences are in practice. Consider an extract of the output on
my system:

...
   0x00000000000028ee <+926>: callq  0x28f3 <tuplesort_gettuple_common+931>
   0x00000000000028f3 <+931>: nopl   0x0(%rax,%rax,1)
   0x00000000000028f8 <+936>: sub    $0x1,%eax
   0x00000000000028fb <+939>: test   %eax,%eax
   0x00000000000028fd <+941>: mov    %eax,0xd0(%rdi)
   0x0000000000002903 <+947>: jne    0x25ce <tuplesort_gettuple_common+126>
   0x0000000000002909 <+953>: jmpq   0x2710 <tuplesort_gettuple_common+448>
   0x000000000000290e <+958>: xchg   %ax,%ax
   0x0000000000002910 <+960>: mov    0x58(%rdi),%rsi
   0x0000000000002914 <+964>: lea    (%rax,%rax,2),%rax
   0x0000000000002918 <+968>: lea    (%rsi,%rax,8),%rax
   0x000000000000291c <+972>: mov    0x30(%rax),%rax
   0x0000000000002920 <+976>: prefetchnta (%rax)
   0x0000000000002923 <+979>: mov    $0x1,%eax
   0x0000000000002928 <+984>: jmpq   0x2712 <tuplesort_gettuple_common+450>
   0x000000000000292d <+989>: nopl   (%rax)
...

Notably, there is a prefetchnta instruction here.

Note that I'm going away on vacation in about a week. I wanted to give
people feedback on various things before then, since it was overdue.
FYI, after Thursday I will be very unlikely to answer e-mail for a
couple of weeks.

--
Peter Geoghegan

Attachment
On 10/11/2015 03:20 AM, Peter Geoghegan wrote:
> On Thu, Sep 3, 2015 at 5:35 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
>> My test cases are:
> 
> Note that my text caching and unsigned integer comparison patches have
> moved the baseline down quite noticeably. I think that my mobile
> processor out-performs the Xeon you used for this, which seems a
> little odd even taken the change in baseline performance into account.
> 

To add a caveat not yet mentioned, the idea behind  prefetching is to
scarifice spare memory bandwidth for performance. That can be
a winning bet on a quiet box (the one we benchmark on), but can backfire
on production db when the extra memory pressure can degrade all running
queries.  Something to test for, or at least keep in mind.

>> set work_mem ='1GB';
>> create table t1 as select md5(random()::text) from
>> generate_series(1,10000000);
>>
>> Times are in milliseconds. Median and average over 10 runs.
>>
>> Test 1
>

I am the reluctant owner of outmoded hardware. Namely a core2 from
around 2007 on plain spinning metal. My results (linux 64bit):

------
Test 1
------
set work_mem ='1GB';
select count(distinct md5) from t1;

== Master ==
42771.040 ms <- outlier?
41704.570 ms
41631.660 ms
41421.877 ms

== Patch ==
42469.911 ms  <- outlier?
41378.556 ms
41375.870 ms
41118.105 ms
41096.283 ms
41095.705 ms

------
Test 2
------
select sum(rn) from (select row_number() over (order by md5) rn from t1) a;

== Master ==
44186.775 ms
44137.154 ms
44111.271 ms
44061.896 ms
44109.122 ms

== Patch ==
44592.590 ms
44403.076 ms
44276.170 ms


very slight difference in an ambiguous direction, but also no perf
catastrophe.

> It's worth considering that for some (possibly legitimate) reason, the
> built-in function call is ignored by your compiler, since GCC has
> license to do that. You might try this on both master and patched
> builds:
> 
> ~/postgresql/src/backend/utils/sort$ gdb -batch -ex 'file tuplesort.o'
> -ex 'disassemble tuplesort_gettuple_common' > prefetch_disassembly.txt
> 
> ...
> 
> Notably, there is a prefetchnta instruction here.
> 

I have verified the prefetech is emitted in the disassembly.

An added benefit of owning outmoded hardware is that the MSR for this
generation is public and I can disable individual prefetcher units by
twiddeling a bit.

Disabling the "HW prefetch" or the "DCU prefetch" units on a pacthed
version gave results that look relatively unchanged, which seems promising.
Disabling them both at once on an unpatched version shows a slowdown of
5-6% in test1 (43347.181, 43898.705, 43399.428). That gives an
indication of maximum potential gains in this direction, for this box
at least.

Finally, I notice my results are 4x slower than everyone else's.
That can be very tough on a man's pride, let me tell you.

Amir




On 11 October 2015 at 13:20, Peter Geoghegan <pg@heroku.com> wrote:
It's worth considering that for some (possibly legitimate) reason, the
built-in function call is ignored by your compiler, since GCC has
license to do that. You might try this on both master and patched
builds:


You're right, gcc did not include the prefetch instructions.
I've tested again on the same machine but with clang 3.7 instead of gcc 4.8.3

I've conducted the same tests again. All times are in milliseconds. Results are the average and median over 10 runs.

set work_mem ='1GB';
create table t1 as select md5(random()::text) from generate_series(1,10000000);
vacuum freeze t1;

Running 1 query at a time the results are as follows:

Test1: select count(distinct md5) from t1;

Master Patched Gain
Average 10853.679 10132.544 107.12%
Median 10754.193 10005.001 107.49%

Test2: select sum(rn) from (select row_number() over (order by md5) rn from t1) a;
Master Patched Gain
Average 11495.8703 11475.0081 100.18%
Median 11495.6015 11455.944 100.35%

Test3: create index t1_md5_idx on t1(md5);
Master Patched Gain
Average 36464.4632 37830.3879 96.39%
Median 35946.608 36765.0055 97.77%

I also decided to run multiple queries at once, to see if there was any cache pollution problems with the prefetching.

Test 1 pgbench -T 600 -c 16 -j 16 -f test1.sql -n
Test 2 pgbench -T 600 -c 16 -j 16 -f test2.sql -n


(tps output from pgbench was converted to milliseconds with 1/TPS*1000)
Master Patched Gain
Test 1 1375.413 1358.494 101.25%
Test 2 1594.753 1588.340 100.40%


CPU: 1 x  Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz

I've attached a spreadsheet with all of the results.


--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
Attachment

Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Sun, Nov 29, 2015 at 8:52 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:
> You're right, gcc did not include the prefetch instructions.
> I've tested again on the same machine but with clang 3.7 instead of gcc
> 4.8.3

Thanks for going to the trouble of investigating this.

These results are mediocre -- in general, it seems like the prefetch
instructions are almost as likely to hurt as to help. Naturally, I
don't want to go down the road of considering every microarchitecture,
and that seems to be what it would take to get this to work well, if
that's possible at all.

I'm currently running some benchmarks on my external sorting patch on
the POWER7 machine that Robert Haas and a few other people have been
using for some time now [1]. So far, the benchmarks look very good
across a variety of scales.

I'll run a round of tests without the prefetching enabled (which the
patch series makes further use of -- they're also used when writing
tuples out). If there is no significant impact, I'll completely
abandon this patch, and we can move on.

[1] http://rhaas.blogspot.com/2012/03/performance-and-scalability-on-ibm.html
-- 
Peter Geoghegan



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Sun, Nov 29, 2015 at 10:14 PM, Peter Geoghegan <pg@heroku.com> wrote:
> I'm currently running some benchmarks on my external sorting patch on
> the POWER7 machine that Robert Haas and a few other people have been
> using for some time now [1]. So far, the benchmarks look very good
> across a variety of scales.
>
> I'll run a round of tests without the prefetching enabled (which the
> patch series makes further use of -- they're also used when writing
> tuples out). If there is no significant impact, I'll completely
> abandon this patch, and we can move on.

I took a look at this. It turns out prefetching significantly helps on
the POWER7 system, when sorting gensort tables of 50 million, 100
million, 250 million, and 500 million tuples (3 CREATE INDEX tests for
each case, 1GB maintenance_work_mem):

[pg@hydra gensort]$ cat test_output_patch_1gb.txt | grep "sort ended"
LOG:  external sort ended, 171063 disk blocks used: CPU 4.33s/71.28u
sec elapsed 75.75 sec
LOG:  external sort ended, 171063 disk blocks used: CPU 4.30s/71.32u
sec elapsed 75.91 sec
LOG:  external sort ended, 171063 disk blocks used: CPU 4.29s/71.34u
sec elapsed 75.69 sec
LOG:  external sort ended, 342124 disk blocks used: CPU 8.10s/165.56u
sec elapsed 174.35 sec
LOG:  external sort ended, 342124 disk blocks used: CPU 8.07s/165.15u
sec elapsed 173.70 sec
LOG:  external sort ended, 342124 disk blocks used: CPU 8.01s/164.73u
sec elapsed 174.84 sec
LOG:  external sort ended, 855306 disk blocks used: CPU 23.65s/491.37u
sec elapsed 522.44 sec
LOG:  external sort ended, 855306 disk blocks used: CPU 21.13s/508.02u
sec elapsed 530.48 sec
LOG:  external sort ended, 855306 disk blocks used: CPU 22.63s/475.33u
sec elapsed 499.09 sec
LOG:  external sort ended, 1710613 disk blocks used: CPU
47.99s/1016.78u sec elapsed 1074.55 sec
LOG:  external sort ended, 1710613 disk blocks used: CPU
46.52s/1015.25u sec elapsed 1078.23 sec
LOG:  external sort ended, 1710613 disk blocks used: CPU
44.34s/1013.26u sec elapsed 1067.16 sec
[pg@hydra gensort]$ cat test_output_patch_noprefetch_1gb.txt | grep "sort ended"
LOG:  external sort ended, 171063 disk blocks used: CPU 4.79s/78.14u
sec elapsed 83.03 sec
LOG:  external sort ended, 171063 disk blocks used: CPU 3.85s/77.71u
sec elapsed 81.64 sec
LOG:  external sort ended, 171063 disk blocks used: CPU 3.94s/77.71u
sec elapsed 81.71 sec
LOG:  external sort ended, 342124 disk blocks used: CPU 8.88s/180.15u
sec elapsed 189.69 sec
LOG:  external sort ended, 342124 disk blocks used: CPU 8.30s/179.07u
sec elapsed 187.92 sec
LOG:  external sort ended, 342124 disk blocks used: CPU 8.29s/179.06u
sec elapsed 188.02 sec
LOG:  external sort ended, 855306 disk blocks used: CPU 22.16s/516.86u
sec elapsed 541.35 sec
LOG:  external sort ended, 855306 disk blocks used: CPU 21.66s/513.59u
sec elapsed 538.00 sec
LOG:  external sort ended, 855306 disk blocks used: CPU 22.56s/499.63u
sec elapsed 525.53 sec
LOG:  external sort ended, 1710613 disk blocks used: CPU
45.00s/1062.26u sec elapsed 1118.52 sec
LOG:  external sort ended, 1710613 disk blocks used: CPU
44.42s/1061.33u sec elapsed 1117.27 sec
LOG:  external sort ended, 1710613 disk blocks used: CPU
44.47s/1064.93u sec elapsed 1118.79 sec

For example, the 50 million tuple test has over 8% of its runtime
shaved off. This seems to be a consistent pattern.

Note that only the writing of tuples uses prefetching here, because
that happens to be the only affected codepath for prefetching (note
also that this is the slightly different, external-specific version of
the patch). I hesitate to give that up, although it is noticeable that
it matters less at higher scales, where we're bottlenecked on
quicksorting itself, more so than writing. Those costs grow at
different rates, of course.

Perhaps we can consider more selectively applying prefetching in the
context of writing out tuples. After all, the amount of useful work
that we can do pending fetching from memory ought to be more
consistent and manageable, which could make it a consistent win. I
will need to think about this some more.

-- 
Peter Geoghegan



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Mon, Nov 30, 2015 at 1:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
> For example, the 50 million tuple test has over 8% of its runtime
> shaved off. This seems to be a consistent pattern.

I included the nitty-gritty details of this case in something attached
to a mail I just sent, over in the "Quicksort for every external sort
run" thread, FWIW.

-- 
Peter Geoghegan



Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore

From
Peter Geoghegan
Date:
On Mon, Nov 30, 2015 at 1:04 PM, Peter Geoghegan <pg@heroku.com> wrote:
> Perhaps we can consider more selectively applying prefetching in the
> context of writing out tuples.

It may still be worth selectively applying these techniques to writing
out tuples, per my previous remarks (and the latest "Using quicksort
for every external sort run" revision, which has a specialized version
of this patch). I see real benefits there across a variety of
situations on different hardware, and I'm hesitant to give that up
without further analysis.

However, I think that this patch is over as an independent piece of
work -- there is too much of a mixed picture. I'm going to mark this
patch "returned with feedback".

-- 
Peter Geoghegan