Thread: Improving executor performance

Improving executor performance

From

Andres Freund

Date:

24 June 2016, 23:30:02

Hi,

at pgcon, in [1], and various other threads and conferences we talked
about our executor performance needing improvements to perform well in
!OLTP workloads, and how we can get there.

I've played with various techniques, on and off over the last few
weeks. Including trying to make slot_deform_tuple et al. faster, batched
execution for a few node types (SeqScan, hashed Agg), avoiding linked
lists in executor datastructures, and smaller micro optimizations.

As a motivation, here's a somewhat juicy example of the benefits that
can be gained (disabled parallelism, results vary too much):
SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ;
non-optimized: 9075.835 ms optimized: 5194.024 ms

and that's far from doing all the possible work for that query;
especially the batching is far from optimal.

So far I've mostly looked at performance for tpc-h, not because it's a
particularly good workload, but because it's available. If somebody has
good suggestions for an analytics heavy workload...

My observations about the performance bottlenecks I found, and partially
addressed, are in rough order of importance (there's interdependence
between most of them):

1) Cache misses cost us a lot, doing more predictable accesses can make a huge difference. Without addressing that,
manyother bottlenecks don't matter all that much. I see *significant* performance improvements for large seqscans
(whenthe results are used) simply memcpy()'ing the current target block.

This partially is an intrinsic problem of analyzing a lot of data, and partially because our access patterns are
bad.

2) Baring 1) tuple deforming is the biggest bottleneck in nearly all queries I looked at. There's various places we
triggerdeforming, most ending in either slot_deform_tuple(), heap_getattr(), heap_deform_tuple().

This can be optimized significantly, but even after that it's still a major issue.
Part of the problem is that we sometimes don't know how many elements to access (especially in index and hashing
relatedcode), the other part that we're deforming a lot of columns that we'll never need, just because we need a
subsequentone.

The other part is that our tuple format requires a number of relatively expensive operations to access data (e.g.
alignment computations, checking the null bitmap).

3) Our 1-by-1 tuple flow in the executor has two major issues:
The biggest is that we perform a *lot* of repetitive work for each processed tuple. E.g. looking at nodeAgg.c's
agg_fill_hash_table(), for each single tuple we execute ExecProcNode(), ExecSeqScan(), heap_getnext(),
lookup_hash_entry(),LookupTupleHashEntry(), ... and many more. Each of these has a noticeable per-call state setup.
Whenexecuting on batches of tuples, a lot of that setup work can be done once per batch, instead of once per tuple.
Partof that the compiler can do automatically, part of that has to be done by hand.

Also very significantly, due to the amount of code executed, there's barely any working branch prediction, leading
tomassive amounts of pipeline stalls and instruction cache misses.

There's also the fact that some future optimizations around using SIMD are easier when looking at batches of tuples,
butI'm not planning to work on that.

4) Various missing micro optimizations have to be performed, for more architectural issues to become visible. E.g. [2]
causessuch bad slowdowns in hash-agg workloads, that other bottlenecks are hidden.

5) Per-tuple heap_getnext() makes it hard to fix 3), and has similar issues. Using a per-block heap_getmany() that
fillsa batch at once is noticeably faster (and more convenient in a batched world anyway)

6) The use of linked lists adds noticeable #instructions overhead and branch misses. Just replacing
{FuncExprState,BoolExprState}->args with an array gives a ~10%ish boost for queries with a bunch of quals. Another
examplethat appears to hurts noticeably, but which I've not experimentally fixed, is AggState->hash_needed.

To a large degree these seem fairly easily fixable; it's kinda boring work, but not really hard.

I'm planning to start individual subthreads for some of these, with
in-detail discussions and/or patches. But I wanted to present this on a
higher level once, before spending even more time.

Have I missed concrete issues others noticed? Missed significant
improvements (please, please, only realistic stuff)?

Unfortunately these items are heavily interdependent. For some queries
you'll need issue n) addressed before n+2) shows a benefit, for some
it's the other way, and such. Given the size of the overall task, the
only way I can see this being reasonably addressable, is trying to get
smaller steps in individually, even if not a huge win on their own. And
then focus on the the larger architectural changes (primarily 3))
issues.

Comments so far? Different suggestions on how to get improvements
around this merged?

Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/CA%2BTgmobx8su_bYtAa3DgrqB%2BR7xZG6kHRj0ccMUUshKAQVftww%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20160622002607.mbsfsm42cxqomi4d%40alap3.anarazel.de

Thread: Improving executor performance

Attachment

Attachment