Memory prefetching while sequentially fetching from SortTuple array, tuplestore - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Memory prefetching while sequentially fetching from SortTuple array, tuplestore
Date
Msg-id CAM3SWZRR-9XX3dk+G_B0U9qLgV9jory1YFWr7YmsAeF3feL_Fg@mail.gmail.com
Whole thread Raw
Responses Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore
List pgsql-hackers
I've always thought that GCC intrinsics for software-based memory
prefetching are a very sharp tool. While it's really hard to use GCC's
__builtin_prefetch() effectively (I've tried enough times to know), I
always thought that there'd probably end up being a handful of cases
where using it presented a clear and worthwhile win. Hash joins seemed
to me to be a likely candidate, because memory latency is a real
killer there. However, I stumbled upon another case where prefetching
pays clear dividends while actually being quite simple: sequentially
reading an already-sorted memtuples array (an array of SortTuple
structs), during some kind of post-processing.

This comes up for many common and important cases, including regular
sort nodes (heaptuple sorts), datumsorts with pass-by-reference types
(e.g. ordered set aggregates), and B-Tree index builds, where the
sequential fetching occurs as actual B-Tree leaf pages are built from
the array.

Patch
=====

Attached patch adds a portable Postgres wrapper on the GCC intrinsic.
It also adds a client within tuplesort.c -- a common function that
seems like a good generic choke point. I can get a speed-up of 6% - 9%
for all of these cases by prefetching a few SortTuples ahead, for the
"tuple proper". (In-memory sorts only.)

ISTM that this alone makes the patch worthwhile, but we should still
consider the break-down of costs in each of these operations, and that
the cost of the sequential reading and processing of SortTuples is all
that has been touched by the patch. There are only about 3 real lines
of code added to tuplesort. Obviously, the dominant costs within each
of these queries/utility commands are the sort proper, and to a lesser
extent the heap scan preceding it. I am not targeting those at all --
I'm only targeting the tertiary cost of sequentially scanning the
memtuples array by hiding memory latency, and yet I end up with a
notable overall speed-up.

It's only really fair to look at this final sequential fetching cost
in isolation. The reduction in that cost in isolation seems to be
huge. I tried out the proprietary Intel vTune Amplifier utility with
this case, and it indicated that CPU time spent in calls to
_bt_buildadd was less than 1/8 the previous time in a comparable test
of the master branch, with no other changes that I can attribute to
this patch (i.e. there is a bit of noise). You can easily get some
hint of the size of the improvement by reviewing "trace_sort = on" log
output, and the reduction of time spent after "performsort" is done
but before the sort is shut down.

Tuplestores
----------------

tuplestore.c has a similar optimization added, on the theory that it
should match tuplesort.c wherever possible, and because it seems to
help a lot there too. For example, queries that perform a big CTE Scan
seem to benefit to about the same degree (although, again, only when
the array is fully in memory).

Portability concerns
===============

I started writing this patch with an intuition that this would help. I
arrived at the fixed-up version posted here through frobbing the code
based on the feedback of Postgres running on my laptop. I iteratively
tweaked the number of tuples ahead to fetch from, and the
__builtin_prefetch() temporal locality argument's value, which helped
a little for a number of test cases. If that doesn't inspire much
confidence in this patch, then good -- that's the idea. Obviously if
we are going to commit this, there needs to be testing of a reasonably
representative selection of platforms.

On the other hand, I couldn't find a case that was regressed, and I am
encouraged by the fact that this helps the post-processing of
SortTuples so effectively. I imagine that some platform would have
very different performance characteristics in order for this to be the
wrong thing. However, it seems possible that there is a case that has
been regressed, since the prefetch is added at a very generic point
within tuplesort.c and tuplestore.c.
--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Size of TOAST pointer
Next
From: Peter Geoghegan
Date:
Subject: Re: Memory prefetching while sequentially fetching from SortTuple array, tuplestore