Re: asynchronous and vectorized execution - Mailing list pgsql-hackers
From | Ants Aasma |
---|---|
Subject | Re: asynchronous and vectorized execution |
Date | |
Msg-id | CA+CSw_vXuJpevKDKdd6LhCeupXUjF18itJ7SQEtcm8Fj_DpCzQ@mail.gmail.com Whole thread Raw |
In response to | asynchronous and vectorized execution (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: asynchronous and vectorized execution
|
List | pgsql-hackers |
On Tue, May 10, 2016 at 7:56 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 9, 2016 at 8:34 PM, David Rowley > <david.rowley@2ndquadrant.com> wrote: > I don't have any at the moment, but I'm not keen on hundreds of new > vector functions that can all have bugs or behavior differences versus > the unvectorized versions of the same code. That's a substantial tax > on future development. I think it's important to understand what > sorts of queries we are targeting here. KaiGai's GPU-acceleration > stuff does great on queries with complex WHERE clauses, but most > people don't care not only because it's out-of-core but because who > actually looks for the records where (a + b) % c > (d + e) * f / g? > This seems like it has the same issue. If we can speed up common > queries people are actually likely to run, OK, that's interesting. I have seen pretty complex expressions in the projection and aggregation. Couple dozen SUM(CASE WHEN a THEN b*c ELSE MIN(d,e)*f END) type of expressions. In critical places had to replace them with a C coded function that processed a row at a time to avoid the executor dispatch overhead. > By the way, I think KaiGai's GPU-acceleration stuff points to another > pitfall here. There's other stuff somebody might legitimately want to > do that requires another copy of each function. For example, run-time > code generation likely needs that (a function to tell the code > generator what to generate for each of our functions), and > GPU-acceleration probably does, too. If fixing a bug in numeric_lt > requires changing not only the regular version and the vectorized > version but also the GPU-accelerated version and the codegen version, > Tom and Dean are going to kill us. And justifiably so! Granted, > nobody is proposing those other features in core right now, but > they're totally reasonable things to want to do. My thoughts in this area have been circling around getting LLVM to do the heavy lifting. LLVM/clang could compile existing C functions to IR and bundle those with the DB. At query planning time or maybe even during execution the functions can be inlined into the compiled query plan, LLVM can then be coaxed to copy propagate, constant fold and dead code eliminate the bejeezus out of the expression tree. This way duplication of the specialized code can be kept to a minimum while at least the common cases can completely avoid the fmgr overhead. This approach would also mesh together fine with batching. Given suitably regular data structures and simple functions, LLVM will be able to vectorize the code. If not it will still end up with a nice tight loop that is an order of magnitude or two faster than the current executor. The first cut could take care of ExecQual, ExecTargetList and friends. Later improvements could let execution nodes provide basic blocks that would then be threaded together into the main execution loop. If some node does not implement the basic block interface a default implementation is used that calls the current interface. It gets a bit handwavy at this point, but the main idea would be to enable data marshaling so that values can be routed directly to the code that needs them without being written to intermediate storage. > I suspect the number of queries that are being hurt by fmgr overhead > is really large, and I think it would be nice to attack that problem > more directly. It's a bit hard to discuss what's worthwhile in the > abstract, without performance numbers, but when you vectorize, how > much is the benefit from using SIMD instructions and how much is the > benefit from just not going through the fmgr every time? My feeling is the same, fmgr overhead and data marshaling, dynamic dispatch through the executor is the big issue. This is corroborated by what I have seen found by other VM implementations. Once you get the data into an uniform format where vectorized execution could be used, the CPU execution resources are no longer the bottleneck. Memory bandwidth gets in the way, unless each input value is used in multiple calculations. And even then, we are looking at a 4x speedup at best. Regards, Ants Aasma
pgsql-hackers by date: