Re: asynchronous and vectorized execution - Mailing list pgsql-hackers

From Ants Aasma
Subject Re: asynchronous and vectorized execution
Date
Msg-id CA+CSw_vXuJpevKDKdd6LhCeupXUjF18itJ7SQEtcm8Fj_DpCzQ@mail.gmail.com
Whole thread Raw
In response to asynchronous and vectorized execution  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: asynchronous and vectorized execution
List pgsql-hackers
On Tue, May 10, 2016 at 7:56 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 9, 2016 at 8:34 PM, David Rowley
> <david.rowley@2ndquadrant.com> wrote:
> I don't have any at the moment, but I'm not keen on hundreds of new
> vector functions that can all have bugs or behavior differences versus
> the unvectorized versions of the same code.  That's a substantial tax
> on future development.  I think it's important to understand what
> sorts of queries we are targeting here.  KaiGai's GPU-acceleration
> stuff does great on queries with complex WHERE clauses, but most
> people don't care not only because it's out-of-core but because who
> actually looks for the records where (a + b) % c > (d + e) * f / g?
> This seems like it has the same issue.  If we can speed up common
> queries people are actually likely to run, OK, that's interesting.

I have seen pretty complex expressions in the projection and
aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f
END) type of expressions. In critical places had to replace them with
a C coded function that processed a row at a time to avoid the
executor dispatch overhead.

> By the way, I think KaiGai's GPU-acceleration stuff points to another
> pitfall here.  There's other stuff somebody might legitimately want to
> do that requires another copy of each function. For example, run-time
> code generation likely needs that (a function to tell the code
> generator what to generate for each of our functions), and
> GPU-acceleration probably does, too.  If fixing a bug in numeric_lt
> requires changing not only the regular version and the vectorized
> version but also the GPU-accelerated version and the codegen version,
> Tom and Dean are going to kill us.  And justifiably so!  Granted,
> nobody is proposing those other features in core right now, but
> they're totally reasonable things to want to do.

My thoughts in this area have been circling around getting LLVM to do
the heavy lifting. LLVM/clang could compile existing C functions to IR
and bundle those with the DB. At query planning time or maybe even
during execution the functions can be inlined into the compiled query
plan, LLVM can then be coaxed to copy propagate, constant fold and
dead code eliminate the bejeezus out of the expression tree. This way
duplication of the specialized code can be kept to a minimum while at
least the common cases can completely avoid the fmgr overhead.

This approach would also mesh together fine with batching. Given
suitably regular data structures and simple functions, LLVM will be
able to vectorize the code. If not it will still end up with a nice
tight loop that is an order of magnitude or two faster than the
current executor.

The first cut could take care of ExecQual, ExecTargetList and friends.
Later improvements could let execution nodes provide basic blocks that
would then be threaded together into the main execution loop. If some
node does not implement the basic block interface a default
implementation is used that calls the current interface. It gets a bit
handwavy at this point, but the main idea would be to enable data
marshaling so that values can be routed directly to the code that
needs them without being written to intermediate storage.

> I suspect the number of queries that are being hurt by fmgr overhead
> is really large, and I think it would be nice to attack that problem
> more directly.  It's a bit hard to discuss what's worthwhile in the
> abstract, without performance numbers, but when you vectorize, how
> much is the benefit from using SIMD instructions and how much is the
> benefit from just not going through the fmgr every time?

My feeling is the same, fmgr overhead and data marshaling, dynamic
dispatch through the executor is the big issue. This is corroborated
by what I have seen found by other VM implementations. Once you get
the data into an uniform format where vectorized execution could be
used, the CPU execution resources are no longer the bottleneck. Memory
bandwidth gets in the way, unless each input value is used in multiple
calculations. And even then, we are looking at a 4x speedup at best.

Regards,
Ants Aasma



pgsql-hackers by date:

Previous
From: "David E. Wheeler"
Date:
Subject: Does Type Have = Operator?
Next
From: Andres Freund
Date:
Subject: Re: asynchronous and vectorized execution