Thread: Improving executor performance
Hi, at pgcon, in [1], and various other threads and conferences we talked about our executor performance needing improvements to perform well in !OLTP workloads, and how we can get there. I've played with various techniques, on and off over the last few weeks. Including trying to make slot_deform_tuple et al. faster, batched execution for a few node types (SeqScan, hashed Agg), avoiding linked lists in executor datastructures, and smaller micro optimizations. As a motivation, here's a somewhat juicy example of the benefits that can be gained (disabled parallelism, results vary too much): SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ; non-optimized: 9075.835 ms optimized: 5194.024 ms and that's far from doing all the possible work for that query; especially the batching is far from optimal. So far I've mostly looked at performance for tpc-h, not because it's a particularly good workload, but because it's available. If somebody has good suggestions for an analytics heavy workload... My observations about the performance bottlenecks I found, and partially addressed, are in rough order of importance (there's interdependence between most of them): 1) Cache misses cost us a lot, doing more predictable accesses can make a huge difference. Without addressing that, manyother bottlenecks don't matter all that much. I see *significant* performance improvements for large seqscans (whenthe results are used) simply memcpy()'ing the current target block. This partially is an intrinsic problem of analyzing a lot of data, and partially because our access patterns are bad. 2) Baring 1) tuple deforming is the biggest bottleneck in nearly all queries I looked at. There's various places we triggerdeforming, most ending in either slot_deform_tuple(), heap_getattr(), heap_deform_tuple(). This can be optimized significantly, but even after that it's still a major issue. Part of the problem is that we sometimes don't know how many elements to access (especially in index and hashing relatedcode), the other part that we're deforming a lot of columns that we'll never need, just because we need a subsequentone. The other part is that our tuple format requires a number of relatively expensive operations to access data (e.g. alignment computations, checking the null bitmap). 3) Our 1-by-1 tuple flow in the executor has two major issues: The biggest is that we perform a *lot* of repetitive work for each processed tuple. E.g. looking at nodeAgg.c's agg_fill_hash_table(), for each single tuple we execute ExecProcNode(), ExecSeqScan(), heap_getnext(), lookup_hash_entry(),LookupTupleHashEntry(), ... and many more. Each of these has a noticeable per-call state setup. Whenexecuting on batches of tuples, a lot of that setup work can be done once per batch, instead of once per tuple. Partof that the compiler can do automatically, part of that has to be done by hand. Also very significantly, due to the amount of code executed, there's barely any working branch prediction, leading tomassive amounts of pipeline stalls and instruction cache misses. There's also the fact that some future optimizations around using SIMD are easier when looking at batches of tuples, butI'm not planning to work on that. 4) Various missing micro optimizations have to be performed, for more architectural issues to become visible. E.g. [2] causessuch bad slowdowns in hash-agg workloads, that other bottlenecks are hidden. 5) Per-tuple heap_getnext() makes it hard to fix 3), and has similar issues. Using a per-block heap_getmany() that fillsa batch at once is noticeably faster (and more convenient in a batched world anyway) 6) The use of linked lists adds noticeable #instructions overhead and branch misses. Just replacing {FuncExprState,BoolExprState}->args with an array gives a ~10%ish boost for queries with a bunch of quals. Another examplethat appears to hurts noticeably, but which I've not experimentally fixed, is AggState->hash_needed. To a large degree these seem fairly easily fixable; it's kinda boring work, but not really hard. I'm planning to start individual subthreads for some of these, with in-detail discussions and/or patches. But I wanted to present this on a higher level once, before spending even more time. Have I missed concrete issues others noticed? Missed significant improvements (please, please, only realistic stuff)? Unfortunately these items are heavily interdependent. For some queries you'll need issue n) addressed before n+2) shows a benefit, for some it's the other way, and such. Given the size of the overall task, the only way I can see this being reasonably addressable, is trying to get smaller steps in individually, even if not a huge win on their own. And then focus on the the larger architectural changes (primarily 3)) issues. Comments so far? Different suggestions on how to get improvements around this merged? Greetings, Andres Freund [1] http://archives.postgresql.org/message-id/CA%2BTgmobx8su_bYtAa3DgrqB%2BR7xZG6kHRj0ccMUUshKAQVftww%40mail.gmail.com [2] https://www.postgresql.org/message-id/20160622002607.mbsfsm42cxqomi4d%40alap3.anarazel.de
Hi, On 2016-06-24 16:29:53 -0700, Andres Freund wrote: > Comments so far? Different suggestions on how to get improvements > around this merged? Oh, and if somebody is interested on collaborating on some of these... This is far too big for me to tackle alone. Andres
On Fri, Jun 24, 2016 at 4:29 PM, Andres Freund <andres@anarazel.de> wrote: > As a motivation, here's a somewhat juicy example of the benefits that > can be gained (disabled parallelism, results vary too much): > SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ; > non-optimized: 9075.835 ms > optimized: 5194.024 ms > > and that's far from doing all the possible work for that query; > especially the batching is far from optimal. That's pretty great. Let me first just say that I think that your broadly have the right idea here, and that it will likely be a big area to work on in the years ahead. This may become a big, general area with many sub-projects, kind of like parallelism. Also, I think it's very much complementary to parallelism. > So far I've mostly looked at performance for tpc-h, not because it's a > particularly good workload, but because it's available. If somebody has > good suggestions for an analytics heavy workload... I think that it's the only thing available that isn't a pain to setup. > My observations about the performance bottlenecks I found, and partially > addressed, are in rough order of importance (there's interdependence > between most of them): > > (List of various bottlenecks) > I'm planning to start individual subthreads for some of these, with > in-detail discussions and/or patches. But I wanted to present this on a > higher level once, before spending even more time. Good call. > Have I missed concrete issues others noticed? Missed significant > improvements (please, please, only realistic stuff)? I'll add one: We should have a way to make routines like heap_copytuple() "allocate" into the caller's own batch memory. We may be able to come up with a reasonably generic abstraction that minimizes the special case handling of things like exhausting caller's batch allocation. Failure to do something like this wastes an awful lot of memory, but more importantly causes a lot of unnecessary cache misses in tuplestore.c, and even with the 9.6 work in tuplesort.c. Someone should also work on palloc() fragmentation, but I tend to think that that's a less efficient way of improving matters. I'd leave that until later. Having an excessive number of retail palloc() calls for cases where the caller really does process tuples in a batch fashion is inherently inefficient. I'm all for the simplicity of the memory context abstraction, but ISTM that a little specialization gets you very far. > Unfortunately these items are heavily interdependent. For some queries > you'll need issue n) addressed before n+2) shows a benefit, for some > it's the other way, and such. Given the size of the overall task, the > only way I can see this being reasonably addressable, is trying to get > smaller steps in individually, even if not a huge win on their own. And > then focus on the the larger architectural changes (primarily 3)) > issues. I could not agree more. This should be strongly emphasized. I have parallel CREATE INDEX working pretty well now. I don't think I could have done that without the 9.6 work on external sorting's cache efficiency, so in some sense that earlier work has enabled parallelism in a non-obvious way. I think that abbreviated keys will prove really important for getting the best out of parallel sort (and that a cost model will have to consider stuff like leading attribute cardinality -- in other words, it must consider CPU cache efficiency). I also suspect that solving the problems with heap_deform_tuple() first would make my pending parallel CREATE INDEX patch look a lot better relative to a serial CREATE INDEX. That's the kind of intuitive speculation that's really hard to prove. It may be that it doesn't matter there, because working on fixing that yields obvious benefits. But, as you suggest, we should consider that that might not always be true. That's really tricky, and has historically been the kind of thing we've managed very badly. -- Peter Geoghegan
On Fri, Jun 24, 2016 at 05:25:27PM -0700, Peter Geoghegan wrote: > On Fri, Jun 24, 2016 at 4:29 PM, Andres Freund <andres@anarazel.de> wrote: > > As a motivation, here's a somewhat juicy example of the benefits that > > can be gained (disabled parallelism, results vary too much): > > SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ; > > non-optimized: 9075.835 ms > > optimized: 5194.024 ms > > > > and that's far from doing all the possible work for that query; > > especially the batching is far from optimal. > > That's pretty great. Let me first just say that I think that your > broadly have the right idea here, and that it will likely be a big > area to work on in the years ahead. This may become a big, general > area with many sub-projects, kind of like parallelism. Also, I think > it's very much complementary to parallelism. Agreed. I did put out a blog entry about this in April that has links to some external projects trying to address this issue: http://momjian.us/main/blogs/pgblog/2016.html#April_1_2016 -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 25 June 2016 05:00, Andres Freund Wrote: >To: pgsql-hackers@postgresql.org >Subject: [HACKERS] Improving executor performance > >My observations about the performance bottlenecks I found, and partially >addressed, are in rough order of importance (there's interdependence >between most of them): > >1) Cache misses cost us a lot, doing more predictable accesses can make > a huge difference. Without addressing that, many other bottlenecks > don't matter all that much. I see *significant* performance > improvements for large seqscans (when the results are used) simply > memcpy()'ing the current target block. > > This partially is an intrinsic problem of analyzing a lot of data, > and partially because our access patterns are bad. > > >2) Baring 1) tuple deforming is the biggest bottleneck in nearly all > queries I looked at. There's various places we trigger deforming, > most ending in either slot_deform_tuple(), heap_getattr(), > heap_deform_tuple(). Agreed, I had also observed similar behavior specifically (2) while working on improving executor performance. I had done some prototype work on this to improve performance by using native compilation. Details of same is available at my blog: http://rajeevrastogi.blogspot.in/2016/03/native-compilation-part-2-pgday-asia.html >3) Our 1-by-1 tuple flow in the executor has two major issues: Agreed, In order to tackle this IMHO, we should 1. Makes the processing data-centric instead of operator centric. 2. Instead of pulling each tuple from immediate operator, operator can push the tuple to its parent. It can be allowed topush until it sees any operator, which cannot be processed without result from other operator. More details from another thread: https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com Thanks and Regards, Kumar Rajeev Rastogi
Hi, On 2016-06-28 10:01:28 +0000, Rajeev rastogi wrote: > >3) Our 1-by-1 tuple flow in the executor has two major issues: > > Agreed, In order to tackle this IMHO, we should > 1. Makes the processing data-centric instead of operator centric. > 2. Instead of pulling each tuple from immediate operator, operator can push the tuple to its parent. It can be allowedto push until it sees any operator, which cannot be processed without result from other operator. > More details from another thread: > https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com I doubt that that's going to be ok in the generic case (memory usage, materializing too much, "bushy plans", merge joins), and it's way more invasive than what I'm thinking of. I think by having batches of tuples "bubble" up in a vulcano style executor, it's possible to get most of the improvements of both approaches. Andres
On 30 June 2016 at 02:32, Andres Freund <andres@anarazel.de> wrote:
--
Hi,
On 2016-06-28 10:01:28 +0000, Rajeev rastogi wrote:
> >3) Our 1-by-1 tuple flow in the executor has two major issues:
>
> Agreed, In order to tackle this IMHO, we should
> 1. Makes the processing data-centric instead of operator centric.
> 2. Instead of pulling each tuple from immediate operator, operator can push the tuple to its parent. It can be allowed to push until it sees any operator, which cannot be processed without result from other operator.
> More details from another thread:
> https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com
I doubt that that's going to be ok in the generic case (memory usage,
materializing too much, "bushy plans", merge joins)
Yeah. You'd likely start landing up with Haskell-esque predictability of memory use. Given how limited and flawed work_mem handling etc already is, that doesn't sound like an appealing direction to go in. Not without a bunch of infrastructure to manage queue sizes and force work into batches to limit memory use, anyway.
Hi, On 2016-06-24 16:29:53 -0700, Andres Freund wrote: > 2) Baring 1) tuple deforming is the biggest bottleneck in nearly all > queries I looked at. There's various places we trigger deforming, > most ending in either slot_deform_tuple(), heap_getattr(), > heap_deform_tuple(). > > This can be optimized significantly, but even after that it's still a > major issue. > > Part of the problem is that we sometimes don't know how many elements > to access (especially in index and hashing related code), the other > part that we're deforming a lot of columns that we'll never need, > just because we need a subsequent one. > > The other part is that our tuple format requires a number of > relatively expensive operations to access data (e.g. alignment > computations, checking the null bitmap). > 6) The use of linked lists adds noticeable #instructions overhead and > branch misses. Just replacing {FuncExprState,BoolExprState}->args > with an array gives a ~10%ish boost for queries with a bunch of > quals. Another example that appears to hurts noticeably, but which > I've not experimentally fixed, is AggState->hash_needed. > > To a large degree these seem fairly easily fixable; it's kinda boring > work, but not really hard. As previously discussed many of the "architectural" changes show limited success until a few other bottlenecks are fixed. Most prominently slot deforming and, what I'm planning to primarily discuss in this email, expression evaluation. Even in the current state, profiles of queries evaluating a large number of tuples very commonly show expression evaluation to be one of the major costs. Due to the number of recursive calls that's not always easy to pinpoint though, the easiest thing to spot is usually MakeFunctionResultNoSet showing up prominently. While, as in 6) above, removing linked lists from the relevant structures helps, it's not that much. Staring at this for a long while made me realize that, somewhat surprisingly to me, is that one of the major problems is that we are bottlenecked on stack usage. Constantly entering and leaving this many functions for trivial expressions evaluations hurts considerably. Some of that is the additional numbers of instructions, some of that is managing return jump addresses, and some of that the actual bus traffic. It's also rather expensive to check for stack limits at a very high rate. Attached (in patch 0003) is a proof-of-concept implementing an expression evalution framework that doesn't use recursion. Instead ExecInitExpr2 computes a number of 'steps' necessary to compute an expression. These steps are stored in a linear array, and executed one after another (save boolean expressions, which can jump to later steps). E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate const, 3) call function. Expression evaluation then is a switch statement over the opcodes of each of these steps. Using the machinery of those 'steps' to compute projections, quals, and constraint evaluation then allows to reduce the number of indirect function calls/jumps further. My preliminary implementation so far implements only function calls, boolean expression, constant evaluation and variable evaluation. For everything else I'm falling back to the current expression machinery. By combining expression, qual and target list processing, we can also always generate the appropriate slot_getsomeattrs() calls. Note that the patch currently does *NOT* support targetlist SRFs, and instead just errors out. This is not a fundamental issue. I just didn't want to invest time in supporting something we want to reimplement anyway. Similarly subplans currently don't work because of: /* * Evaluate lefthand expressions and form a projection tuple. First we * have to set the econtext to use (hack alert!). which doesn't work quite like that atm. I've used (0004) a similar method to reduce the number of branches and pipeline stalls in slot_deform_tuple considerably. For each attribute a 'step' is generated, which contains exactly the computations required to deform that individual datum. That e.g. allows to separate cases for different alignments and null-checks. Having expression evaluation and slot deforming as a series of simple sequential steps, instead of complex recursive calls, would also make it fairly straightforward to optionally just-in-time compile those. For motivation, here's some random performance differences: SELECT SUM(l_quantity * l_extendedprice) FROM lineitem; master: 5038.382 4965.310 4983.146 patches: 4194.593 4153.759 4168.986 tpch-q1 master: 21274.896 dev: 17952.678 For queries involving more complex expressions, the difference can be a larger. Both, expression processing and tuple deforming, can use considerable additional improvements. But I think the approaches presented here are the biggest step that I can see. The reason that I'm bringing this up before submitting actual 'batch operation' patches is that the architectural improvements are quickly hidden behind these bottlenecks. Before spending time polishing up these approaches, I'd like if anybody fundamentally disagrees with either, or has a better proposal. If not I'm hoping to first "properly" submit the slot deforming for review, and then the expression evaluation. The latter would obviously need to be a lot more complete than now; and we'd likely want to the targetlist SRF rearchitecting beforehand. Comments? Regards, Andres
Attachment
Hi, On 2016-06-24 16:29:53 -0700, Andres Freund wrote: > 4) Various missing micro optimizations have to be performed, for more > architectural issues to become visible. E.g. [2] causes such bad > slowdowns in hash-agg workloads, that other bottlenecks are hidden. One such issue is the usage of dynahash.c in tidbitmap.c. In many queries, e.g. tpch q7, the bitmapscan is often the bottleneck. Profiling shows that this is largely due to dynahash.c being slow. Primary issues are: a) two level structure doubling the amount of indirect lookups b) indirect function calls c) using separate chaining based conflict resolution d) being too general. I've quickly hacked up an alternative linear addressing hashtable implementation. And the improvements are quite remarkable. Example Query: EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date); before: ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Aggregate (cost=942046.72..942046.73 rows=1 width=8) (actual time=4647.908..4647.909 rows=1 loops=1) │ │ -> Bitmap Heap Scan on lineitem (cost=193514.84..919246.17 rows=9120222 width=8) (actual time=2667.821..3885.598 rows=9113219loops=1) │ │ Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │ │ Heap Blocks: exact=585542 │ │ -> Bitmap Index Scan on i_l_shipdate (cost=0.00..191234.78 rows=9120222 width=0) (actual time=2461.622..2461.622rows=9113219 loops=1) │ │ Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │ │ Planning time: 0.170 ms │ │ Execution time: 4648.921 ms │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ timing without analyze: 4136.425 4101.873 4177.441 after: ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Aggregate (cost=942046.72..942046.73 rows=1 width=8) (actual time=3218.422..3218.423 rows=1 loops=1) │ │ -> Bitmap Heap Scan on lineitem (cost=193514.84..919246.17 rows=9120222 width=8) (actual time=1153.707..2430.500 rows=9113219loops=1) │ │ Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │ │ Heap Blocks: exact=585542 │ │ -> Bitmap Index Scan on i_l_shipdate (cost=0.00..191234.78 rows=9120222 width=0) (actual time=952.161..952.161rows=9113219 loops=1) │ │ Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │ │ Planning time: 1.075 ms │ │ Execution time: 3221.861 ms │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (8 rows) timing without analyze: 2647.364 2674.456 2680.197 as you can see the the time for the bitmap index scan goes from 2461.622 to 952.161. I'm not proposing to apply the patch as-is, but I think it's a good starting point to discuss how to evolve our use of hash tables. I'm wondering whether we can do 'macro based templates' or something. I.e. have something like the simplehash in the patch in simplehash.h, but the key/value widths, the function names, are all determined by macros (oh, this would be easier with C++ templates...). Does anybody have a better idea? The major issue with the simplehash implementation in the path is probably the deletion; which should rather move cells around, rather than use toombstones. But that was too complex for a POC ;). Also, it'd likely need a proper iterator interface. FWIW, the dynahash usage in nodeAgg.c is a major bottleneck in a number of other queries. Regards, Andres
Attachment
Andres Freund writes: > Having expression evaluation and slot deforming as a series of simple > sequential steps, instead of complex recursive calls, would also make it > fairly straightforward to optionally just-in-time compile those. I don't think that JIT becomes easier by this change. Constructing the IR for LLVM, libFirm or any other JIT library from expression trees is straightforward already. It's probably more of a nuisance for those that already have some code/design on JIT-compiling expressions (vitessedb, ISP RAS, yours truly) I like your patch for all the other reasons stated though! regards Andreas
On 2016-07-14 20:04:03 +0200, Andreas Seltenreich wrote: > Andres Freund writes: > > > Having expression evaluation and slot deforming as a series of simple > > sequential steps, instead of complex recursive calls, would also make it > > fairly straightforward to optionally just-in-time compile those. > > I don't think that JIT becomes easier by this change. Constructing the > IR for LLVM, libFirm or any other JIT library from expression trees is > straightforward already. It's probably more of a nuisance for those > that already have some code/design on JIT-compiling expressions > (vitessedb, ISP RAS, yours truly) The problem is that the previous form has a lot of ad-hoc analysis strewn in. The interesting part is getting rid of all that. That's what the new ExecInitExpr2() does. The target form can be both evaluated more efficiently in the dispatch manner in the patch, and quite simply converted to a JIT - without duplicating the analysis code. I did write a small ad-hoc x86 jit, and it was really quite straightforward this way. What did you do with JIT and expression evaluation? You basically just replaced the toplevel ExprState note with a different evalfunc, pointing into your code? Andres
Andres Freund writes: > The problem is that the previous form has a lot of ad-hoc analysis > strewn in. The interesting part is getting rid of all that. That's what > the new ExecInitExpr2() does. The target form can be both evaluated more > efficiently in the dispatch manner in the patch, and quite simply > converted to a JIT - without duplicating the analysis code. I did write > a small ad-hoc x86 jit, and it was really quite straightforward this > way. Ja, I see the advantage when doing ad-hoc-JIT compilation. > What did you do with JIT and expression evaluation? You basically just > replaced the toplevel ExprState note with a different evalfunc, pointing > into your code? That's the plan, yes. I'm sorry there's no publishable code yet on the the postgres side of things. Using libFirm[1], the plan is to. 1. Automatically generate Firm-IR for the static C code around expression evaluation as well operators in the system catalog. 2. Construct IR for expression trees (essentially all the function calls the executor would do). 3. Push libFirm's optimize button. At this stage, most of the dispatching goes away by inlining the calls including functionsfrom the catalog implementing operators. 4. Generate code and replace the toplevel expression node with a funcall node. I did implement this recipe with a toy Forth interpreter to see whether libFirm was up to the job (Nobody's done JIT with libFirm before). The results were favorable[2]. Currently, a student at credativ is working on applying these techniques to postgres. regards, Andreas Footnotes: [1] http://libfirm.org/ [2] https://korte.credativ.com/~ase/firm-postgres-jit-forth.pdf
On 2016-07-14 23:03:10 +0200, Andreas Seltenreich wrote: > That's the plan, yes. I'm sorry there's no publishable code yet on the > the postgres side of things. Using libFirm[1], the plan is to. Why libfirm? It seems to only have x86 and sparc backends, and no windows support? > 1. Automatically generate Firm-IR for the static C code around > expression evaluation as well operators in the system catalog. > 2. Construct IR for expression trees (essentially all the function calls > the executor would do). But that essentially means redoing most of execQual's current code in IR - or do you want to do that via 1) above? As long as the preparation code (which is currently intermixed with the execution phase) isn't separated, that means pulling essentially the whole backend given that we do catalog lookups and everything during that phase. > Currently, a student at credativ is working on applying these > techniques to postgres. Are you planning to support this to postgres proper? Regards, Andres
Andres Freund writes: > On 2016-07-14 23:03:10 +0200, Andreas Seltenreich wrote: >> That's the plan, yes. I'm sorry there's no publishable code yet on the >> the postgres side of things. Using libFirm[1], the plan is to. > > Why libfirm? - It has a more modern IR than LLVM (they're catching up though) - It's very lightweight, compiles faster than LLVM's ./configure run - I do have lots of experience with it and a committer bit > It seems to only have x86 and sparc backends, and no windows support? Ack, it's mostly used in research projects, that's why the number of supported ISAs is small. It's enough to answer the burning question what speedup is to expected by jit-compiling things in the backend though. Also, if this thing actually takes off, adding more backends is something that is easier with libFirm than LLVM, IMHO. >> 1. Automatically generate Firm-IR for the static C code around >> expression evaluation as well operators in the system catalog. > >> 2. Construct IR for expression trees (essentially all the function calls >> the executor would do). > > But that essentially means redoing most of execQual's current code in IR > - or do you want to do that via 1) above? Manually re-doing backend logic in IR is a can of worms I do not want to open. This would guarantee bugs and result in a maintenance nightmare, so doing 1) for the code is the only option when something turns out to be a bottleneck IMHO. > As long as the preparation code (which is currently intermixed with > the execution phase) isn't separated, that means pulling essentially > the whole backend given that we do catalog lookups and everything > during that phase. Right, the catalog lookups need to be done before JIT-compilation to allow inlining operators. >> Currently, a student at credativ is working on applying these >> techniques to postgres. > > Are you planning to support this to postgres proper? The goal is to publish it as an extension that sneaks into planner_hook. I think BSD-licensing is ok as long as libfirm (LGPL) is kept as an external dependency. regards, Andreas
On Wed, Jul 13, 2016 at 8:06 PM, Andres Freund <andres@anarazel.de> wrote: > I've quickly hacked up an alternative linear addressing hashtable > implementation. And the improvements are quite remarkable. > > Example Query: > EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date); > timing without analyze: 4136.425 4101.873 4177.441 > > after: > timing without analyze: 2647.364 2674.456 2680.197 > > as you can see the the time for the bitmap index scan goes from 2461.622 > to 952.161. That's pretty great. I wonder what this would look like with a BRIN index, since l_shipdate looks like a good candidate for BRIN indexing. -- Peter Geoghegan
On 2016-07-14 20:41:21 -0700, Peter Geoghegan wrote: > On Wed, Jul 13, 2016 at 8:06 PM, Andres Freund <andres@anarazel.de> wrote: > > I've quickly hacked up an alternative linear addressing hashtable > > implementation. And the improvements are quite remarkable. > > > > Example Query: > > EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <='1996-12-31'::date); > > > timing without analyze: 4136.425 4101.873 4177.441 > > > > after: > > > timing without analyze: 2647.364 2674.456 2680.197 > > > > as you can see the the time for the bitmap index scan goes from 2461.622 > > to 952.161. > > That's pretty great. I wonder what this would look like with a BRIN > index, since l_shipdate looks like a good candidate for BRIN indexing. Brin indexes IIRC always end up using tidbitmap.c, so the benefits should be there as well ;)
On Thu, Jul 14, 2016 at 8:45 PM, Andres Freund <andres@anarazel.de> wrote: > Brin indexes IIRC always end up using tidbitmap.c, so the benefits > should be there as well ;) Right. Might the improvement be even more pronounced, though? I'm not sure how a BRIN index with a suitable physical/logical correlation performs compared to a bitmap index scan of a B-Tree (due to a less selective bitmap index scan qual, as in your example), but offhand I guess that could be faster in general, making the bottleneck you're addressing relatively greater there. -- Peter Geoghegan
On Wed, Jul 13, 2016 at 11:06 PM, Andres Freund <andres@anarazel.de> wrote: > On 2016-06-24 16:29:53 -0700, Andres Freund wrote: >> 4) Various missing micro optimizations have to be performed, for more >> architectural issues to become visible. E.g. [2] causes such bad >> slowdowns in hash-agg workloads, that other bottlenecks are hidden. > > One such issue is the usage of dynahash.c in tidbitmap.c. In many > queries, e.g. tpch q7, the bitmapscan is often the bottleneck. Profiling > shows that this is largely due to dynahash.c being slow. Primary issues > are: a) two level structure doubling the amount of indirect lookups b) > indirect function calls c) using separate chaining based conflict > resolution d) being too general. > > I've quickly hacked up an alternative linear addressing hashtable > implementation. And the improvements are quite remarkable. Nice! > I'm wondering whether we can do 'macro based templates' or > something. I.e. have something like the simplehash in the patch in > simplehash.h, but the key/value widths, the function names, are all > determined by macros (oh, this would be easier with C++ templates...). > > Does anybody have a better idea? No. > The major issue with the simplehash implementation in the path is > probably the deletion; which should rather move cells around, rather > than use toombstones. But that was too complex for a POC ;). Also, it'd > likely need a proper iterator interface. Do we ever need to delete from a TIDBitmap? Probably not, but I'm guessing you have other uses for this in mind. > FWIW, the dynahash usage in nodeAgg.c is a major bottleneck in a number > of other queries. Can we use this implementation for that as well, or are we going to need yet another one? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-07-17 08:32:17 -0400, Robert Haas wrote: > On Wed, Jul 13, 2016 at 11:06 PM, Andres Freund <andres@anarazel.de> wrote: > > The major issue with the simplehash implementation in the path is > > probably the deletion; which should rather move cells around, rather > > than use toombstones. But that was too complex for a POC ;). Also, it'd > > likely need a proper iterator interface. > > Do we ever need to delete from a TIDBitmap? Probably not, but I'm > guessing you have other uses for this in mind. We do, via BitmapAnd. > > FWIW, the dynahash usage in nodeAgg.c is a major bottleneck in a number > > of other queries. > > Can we use this implementation for that as well, or are we going to > need yet another one? I've not tested it, but I'd assume that something like the simplehash should work there. It's a bit more complicated of a scenario though. Andres
On Wed, Jul 13, 2016 at 6:18 PM, Andres Freund <andres@anarazel.de> wrote: > While, as in 6) above, removing linked lists from the relevant > structures helps, it's not that much. Staring at this for a long while > made me realize that, somewhat surprisingly to me, is that one of the > major problems is that we are bottlenecked on stack usage. Constantly > entering and leaving this many functions for trivial expressions > evaluations hurts considerably. Some of that is the additional numbers > of instructions, some of that is managing return jump addresses, and > some of that the actual bus traffic. It's also rather expensive to check > for stack limits at a very high rate. You'll recall how I complained how parallel CREATE INDEX, while generally effective, became incredibly CPU bound on the still-serial merge on my C collated text test case (I told you this in person, I think). I looked into addressing this bottleneck, and made an interesting discovery, which kind of reminds me of what you say here about function call overhead. I hacked up varstrfastcmp_c() to assume 4 byte varlena headers. No function call to pg_detoast_datum_packed() is made (which otherwise happens through all that DatumGetVarStringPP() macro indirection). Of course, that assumption is dangerous in the general case, but it could be made to work in most cases with a little care, as in practice the vast majority of text comparisons are of text Datums with a 4 byte varlena header. SortSupport could reliably detect if it was safe, with a little help from the text opclass, or something along those lines. This might not just be useful with sorting, but it does seem to be particularly useful with parallel sorting. Just on my laptop, this makes some parallel CREATE INDEX gensort test cases [1] take as much as 15% less time to execute overall. That's a big difference. I looked at the disassembly, and the number of instructions for varstrfastcmp_c() was reduced from 113 to 29. That's the kind of difference that could add up to a lot. [1] https://github.com/petergeoghegan/gensort -- Peter Geoghegan
On 07/13/2016 06:18 PM, Andres Freund wrote: > Attached (in patch 0003) is a proof-of-concept implementing an > expression evalution framework that doesn't use recursion. Instead > ExecInitExpr2 computes a number of 'steps' necessary to compute an > expression. These steps are stored in a linear array, and executed one > after another (save boolean expressions, which can jump to later steps). > E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate > const, 3) call function. We've been having trouble with the performance of simple expressions in PLpgSQL so I started playing with this patch. (No sense re-inventing the wheel after all.) It was straightforward to extend to simple expressions and showed an immediate improvement (~10% faster on a simple test). Running in our full environment highlighted a few areas that I think are worth more investigation. However, before I tackle that, is the posted proof-of-concept still the latest and greatest? If not, any chance of getting the latest? Going forward, I'd like to collaborate on our efforts if you're interested. - Doug Doole Salesforce
> Attached (in patch 0003) is a proof-of-concept implementing an > expression evalution framework that doesn't use recursion. Instead > ExecInitExpr2 computes a number of 'steps' necessary to compute an > expression. These steps are stored in a linear array, and executed one > after another (save boolean expressions, which can jump to later steps). > E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate > const, 3) call function. We've been having trouble with the performance of simple expressions in PLpgSQL so I started playing with this patch. (No sense re-inventing the wheel after all.) It was straightforward to extend to simple expressions and showed an immediate improvement (~10% faster on a simple test). Running in our full environment highlighted a few areas that I think are worth more investigation. However, before I tackle that, is the posted proof-of-concept still the latest and greatest? If not, any chance of getting the latest? Going forward, I'd be happy to collaborate on our efforts. - Doug Doole Salesforce
Hi, On 2016-11-04 15:25:58 -0700, Doug Doole wrote: > On 07/13/2016 06:18 PM, Andres Freund wrote: > > Attached (in patch 0003) is a proof-of-concept implementing an > > expression evalution framework that doesn't use recursion. Instead > > ExecInitExpr2 computes a number of 'steps' necessary to compute an > > expression. These steps are stored in a linear array, and executed one > > after another (save boolean expressions, which can jump to later steps). > > E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate > > const, 3) call function. > > We've been having trouble with the performance of simple expressions in > PLpgSQL so I started playing with this patch. (No sense re-inventing the > wheel after all.) It was straightforward to extend to simple expressions and > showed an immediate improvement (~10% faster on a simple test). That's cool. > Running in our full environment highlighted a few areas that I think > are worth more investigation. There indeed should be many of those. > However, before I tackle that, is the posted proof-of-concept still the > latest and greatest? If not, any chance of getting the latest? It's definitely *not* the latest and greatest. My latest version actually passes all regression tests, and has a few additional expression types handled natively (NULL tests, CASE IIRC), and more than few bugs fixed. I'm away at the moment, but I plan to post a new version end of this or beginning of next week. > Going forward, I'd like to collaborate on our efforts if you're interested. That sounds good! Regards, Andres