Thread: Improving executor performance

Improving executor performance

From
Andres Freund
Date:
Hi,

at pgcon, in [1], and various other threads and conferences we talked
about our executor performance needing improvements to perform well in
!OLTP workloads, and how we can get there.

I've played with various techniques, on and off over the last few
weeks. Including trying to make slot_deform_tuple et al. faster, batched
execution for a few node types (SeqScan, hashed Agg), avoiding linked
lists in executor datastructures, and smaller micro optimizations.

As a motivation, here's a somewhat juicy example of the benefits that
can be gained (disabled parallelism, results vary too much):
SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ;
non-optimized: 9075.835 ms   optimized: 5194.024 ms

and that's far from doing all the possible work for that query;
especially the batching is far from optimal.


So far I've mostly looked at performance for tpc-h, not because it's a
particularly good workload, but because it's available.  If somebody has
good suggestions for an analytics heavy workload...


My observations about the performance bottlenecks I found, and partially
addressed, are in rough order of importance (there's interdependence
between most of them):

1) Cache misses cost us a lot, doing more predictable accesses can make  a huge difference. Without addressing that,
manyother bottlenecks  don't matter all that much.  I see *significant* performance  improvements for large seqscans
(whenthe results are used) simply  memcpy()'ing the current target block.
 
  This partially is an intrinsic problem of analyzing a lot of data,  and partially because our access patterns are
bad.


2) Baring 1) tuple deforming is the biggest bottleneck in nearly all  queries I looked at. There's various places we
triggerdeforming,  most ending in either slot_deform_tuple(), heap_getattr(),  heap_deform_tuple().
 
  This can be optimized significantly, but even after that it's still a  major issue.
  Part of the problem is that we sometimes don't know how many elements  to access (especially in index and hashing
relatedcode), the other  part that we're deforming a lot of columns that we'll never need,  just because we need a
subsequentone.
 
  The other part is that our tuple format requires a number of  relatively expensive operations to access data (e.g.
alignment computations, checking the null bitmap).
 


3) Our 1-by-1 tuple flow in the executor has two major issues:
  The biggest is that we perform a *lot* of repetitive work for each  processed tuple. E.g. looking at nodeAgg.c's
agg_fill_hash_table(), for each single tuple we execute ExecProcNode(), ExecSeqScan(),  heap_getnext(),
lookup_hash_entry(),LookupTupleHashEntry(), ... and  many more. Each of these has a noticeable per-call state setup.
Whenexecuting on batches of tuples, a lot of that setup work can be  done once per batch, instead of once per tuple.
Partof that the  compiler can do automatically, part of that has to be done by hand.
 
  Also very significantly, due to the amount of code executed, there's  barely any working branch prediction, leading
tomassive amounts of  pipeline stalls and instruction cache misses.
 
  There's also the fact that some future optimizations around using  SIMD are easier when looking at batches of tuples,
butI'm not  planning to work on that.
 


4) Various missing micro optimizations have to be performed, for more  architectural issues to become visible. E.g. [2]
causessuch bad  slowdowns in hash-agg workloads, that other bottlenecks are hidden.
 


5) Per-tuple heap_getnext() makes it hard to fix 3), and has similar  issues. Using a per-block heap_getmany() that
fillsa batch at once  is noticeably faster (and more convenient in a batched world anyway)
 


6) The use of linked lists adds noticeable #instructions overhead and  branch misses. Just replacing
{FuncExprState,BoolExprState}->args with an array gives a ~10%ish boost for queries with a bunch of  quals.  Another
examplethat appears to hurts noticeably, but which  I've not experimentally fixed, is AggState->hash_needed.
 
  To a large degree these seem fairly easily fixable; it's kinda boring  work, but not really hard.


I'm planning to start individual subthreads for some of these, with
in-detail discussions and/or patches. But I wanted to present this on a
higher level once, before spending even more time.


Have I missed concrete issues others noticed? Missed significant
improvements (please, please, only realistic stuff)?


Unfortunately these items are heavily interdependent. For some queries
you'll need issue n) addressed before n+2) shows a benefit, for some
it's the other way, and such.  Given the size of the overall task, the
only way I can see this being reasonably addressable, is trying to get
smaller steps in individually, even if not a huge win on their own. And
then focus on the the larger architectural changes (primarily 3))
issues.


Comments so far?  Different suggestions on how to get improvements
around this merged?


Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/CA%2BTgmobx8su_bYtAa3DgrqB%2BR7xZG6kHRj0ccMUUshKAQVftww%40mail.gmail.com
[2] https://www.postgresql.org/message-id/20160622002607.mbsfsm42cxqomi4d%40alap3.anarazel.de



Re: Improving executor performance

From
Andres Freund
Date:
Hi,


On 2016-06-24 16:29:53 -0700, Andres Freund wrote:
> Comments so far?  Different suggestions on how to get improvements
> around this merged?

Oh, and if somebody is interested on collaborating on some of
these... This is far too big for me to tackle alone.

Andres



Re: Improving executor performance

From
Peter Geoghegan
Date:
On Fri, Jun 24, 2016 at 4:29 PM, Andres Freund <andres@anarazel.de> wrote:
> As a motivation, here's a somewhat juicy example of the benefits that
> can be gained (disabled parallelism, results vary too much):
> SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ;
> non-optimized: 9075.835 ms
>     optimized: 5194.024 ms
>
> and that's far from doing all the possible work for that query;
> especially the batching is far from optimal.

That's pretty great. Let me first just say that I think that your
broadly have the right idea here, and that it will likely be a big
area to work on in the years ahead. This may become a big, general
area with many sub-projects, kind of like parallelism. Also, I think
it's very much complementary to parallelism.

> So far I've mostly looked at performance for tpc-h, not because it's a
> particularly good workload, but because it's available.  If somebody has
> good suggestions for an analytics heavy workload...

I think that it's the only thing available that isn't a pain to setup.

> My observations about the performance bottlenecks I found, and partially
> addressed, are in rough order of importance (there's interdependence
> between most of them):
>
> (List of various bottlenecks)

> I'm planning to start individual subthreads for some of these, with
> in-detail discussions and/or patches. But I wanted to present this on a
> higher level once, before spending even more time.

Good call.

> Have I missed concrete issues others noticed? Missed significant
> improvements (please, please, only realistic stuff)?

I'll add one: We should have a way to make routines like
heap_copytuple() "allocate" into the caller's own batch memory. We may
be able to come up with a reasonably generic abstraction that
minimizes the special case handling of things like exhausting caller's
batch allocation. Failure to do something like this wastes an awful
lot of memory, but more importantly causes a lot of unnecessary cache
misses in tuplestore.c, and even with the 9.6 work in tuplesort.c.

Someone should also work on palloc() fragmentation, but I tend to
think that that's a less efficient way of improving matters. I'd leave
that until later. Having an excessive number of retail palloc() calls
for cases where the caller really does process tuples in a batch
fashion is inherently inefficient. I'm all for the simplicity of the
memory context abstraction, but ISTM that a little specialization gets
you very far.

> Unfortunately these items are heavily interdependent. For some queries
> you'll need issue n) addressed before n+2) shows a benefit, for some
> it's the other way, and such.  Given the size of the overall task, the
> only way I can see this being reasonably addressable, is trying to get
> smaller steps in individually, even if not a huge win on their own. And
> then focus on the the larger architectural changes (primarily 3))
> issues.

I could not agree more. This should be strongly emphasized.

I have parallel CREATE INDEX working pretty well now. I don't think I
could have done that without the 9.6 work on external sorting's cache
efficiency, so in some sense that earlier work has enabled parallelism
in a non-obvious way. I think that abbreviated keys will prove really
important for getting the best out of parallel sort (and that a cost
model will have to consider stuff like leading attribute cardinality
-- in other words, it must consider CPU cache efficiency). I also
suspect that solving the problems with heap_deform_tuple() first would
make my pending parallel CREATE INDEX patch look a lot better relative
to a serial CREATE INDEX. That's the kind of intuitive speculation
that's really hard to prove. It may be that it doesn't matter there,
because working on fixing that yields obvious benefits. But, as you
suggest, we should consider that that might not always be true. That's
really tricky, and has historically been the kind of thing we've
managed very badly.

-- 
Peter Geoghegan



Re: Improving executor performance

From
Bruce Momjian
Date:
On Fri, Jun 24, 2016 at 05:25:27PM -0700, Peter Geoghegan wrote:
> On Fri, Jun 24, 2016 at 4:29 PM, Andres Freund <andres@anarazel.de> wrote:
> > As a motivation, here's a somewhat juicy example of the benefits that
> > can be gained (disabled parallelism, results vary too much):
> > SELECT l_linestatus, SUM(l_quantity) FROM lineitem GROUP BY 1 ORDER BY 2 DESC LIMIT 10 ;
> > non-optimized: 9075.835 ms
> >     optimized: 5194.024 ms
> >
> > and that's far from doing all the possible work for that query;
> > especially the batching is far from optimal.
> 
> That's pretty great. Let me first just say that I think that your
> broadly have the right idea here, and that it will likely be a big
> area to work on in the years ahead. This may become a big, general
> area with many sub-projects, kind of like parallelism. Also, I think
> it's very much complementary to parallelism.

Agreed.  I did put out a blog entry about this in April that has links
to some external projects trying to address this issue:
http://momjian.us/main/blogs/pgblog/2016.html#April_1_2016

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +



Re: Improving executor performance

From
Rajeev rastogi
Date:
On 25 June 2016 05:00, Andres Freund Wrote:
>To: pgsql-hackers@postgresql.org
>Subject: [HACKERS] Improving executor performance
>
>My observations about the performance bottlenecks I found, and partially
>addressed, are in rough order of importance (there's interdependence
>between most of them):
>
>1) Cache misses cost us a lot, doing more predictable accesses can make
>   a huge difference. Without addressing that, many other bottlenecks
>   don't matter all that much.  I see *significant* performance
>   improvements for large seqscans (when the results are used) simply
>   memcpy()'ing the current target block.
>
>   This partially is an intrinsic problem of analyzing a lot of data,
>   and partially because our access patterns are bad.
>
>
>2) Baring 1) tuple deforming is the biggest bottleneck in nearly all
>   queries I looked at. There's various places we trigger deforming,
>   most ending in either slot_deform_tuple(), heap_getattr(),
>   heap_deform_tuple().

Agreed,
I had also observed similar behavior specifically (2) while working on improving executor performance.
I had done some prototype work on this to improve performance by using native compilation.
Details of same is available at my blog:
http://rajeevrastogi.blogspot.in/2016/03/native-compilation-part-2-pgday-asia.html


>3) Our 1-by-1 tuple flow in the executor has two major issues:

Agreed, In order to tackle this IMHO, we should
1. Makes the processing data-centric instead of operator centric.
2. Instead of pulling each tuple from immediate operator, operator can push the tuple to its parent. It can be allowed
topush until it sees any operator, which cannot be processed without result from other operator.    
More details from another thread:
https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com

Thanks and Regards,
Kumar Rajeev Rastogi



Re: Improving executor performance

From
Andres Freund
Date:
Hi,

On 2016-06-28 10:01:28 +0000, Rajeev rastogi wrote:
> >3) Our 1-by-1 tuple flow in the executor has two major issues:
> 
> Agreed, In order to tackle this IMHO, we should
> 1. Makes the processing data-centric instead of operator centric.
> 2. Instead of pulling each tuple from immediate operator, operator can push the tuple to its parent. It can be
allowedto push until it sees any operator, which cannot be processed without result from other operator.   
 
> More details from another thread:
> https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com 

I doubt that that's going to be ok in the generic case (memory usage,
materializing too much, "bushy plans", merge joins), and it's way more
invasive than what I'm thinking of.  I think by having batches of tuples
"bubble" up in a vulcano style executor, it's possible to get most of
the improvements of both approaches.

Andres



Re: Improving executor performance

From
Craig Ringer
Date:
On 30 June 2016 at 02:32, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-06-28 10:01:28 +0000, Rajeev rastogi wrote:
> >3) Our 1-by-1 tuple flow in the executor has two major issues:
>
> Agreed, In order to tackle this IMHO, we should
> 1. Makes the processing data-centric instead of operator centric.
> 2. Instead of pulling each tuple from immediate operator, operator can push the tuple to its parent. It can be allowed to push until it sees any operator, which cannot be processed without result from other operator.
> More details from another thread:
> https://www.postgresql.org/message-id/BF2827DCCE55594C8D7A8F7FFD3AB77159A9B904@szxeml521-mbs.china.huawei.com

I doubt that that's going to be ok in the generic case (memory usage,
materializing too much, "bushy plans", merge joins)

Yeah. You'd likely start landing up with Haskell-esque predictability of memory use. Given how limited and flawed work_mem handling etc already is, that doesn't sound like an appealing direction to go in. Not without a bunch of infrastructure to manage queue sizes and force work into batches to limit memory use, anyway.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Improving executor performance

From
Andres Freund
Date:
Hi,

On 2016-06-24 16:29:53 -0700, Andres Freund wrote:
> 2) Baring 1) tuple deforming is the biggest bottleneck in nearly all
>    queries I looked at. There's various places we trigger deforming,
>    most ending in either slot_deform_tuple(), heap_getattr(),
>    heap_deform_tuple().
>
>    This can be optimized significantly, but even after that it's still a
>    major issue.
>
>    Part of the problem is that we sometimes don't know how many elements
>    to access (especially in index and hashing related code), the other
>    part that we're deforming a lot of columns that we'll never need,
>    just because we need a subsequent one.
>
>    The other part is that our tuple format requires a number of
>    relatively expensive operations to access data (e.g. alignment
>    computations, checking the null bitmap).

> 6) The use of linked lists adds noticeable #instructions overhead and
>    branch misses. Just replacing {FuncExprState,BoolExprState}->args
>    with an array gives a ~10%ish boost for queries with a bunch of
>    quals.  Another example that appears to hurts noticeably, but which
>    I've not experimentally fixed, is AggState->hash_needed.
>
>    To a large degree these seem fairly easily fixable; it's kinda boring
>    work, but not really hard.

As previously discussed many of the "architectural" changes show limited
success until a few other bottlenecks are fixed. Most prominently slot
deforming and, what I'm planning to primarily discuss in this email,
expression evaluation.

Even in the current state, profiles of queries evaluating a large number
of tuples very commonly show expression evaluation to be one of the
major costs. Due to the number of recursive calls that's not always easy
to pinpoint though, the easiest thing to spot is usually
MakeFunctionResultNoSet showing up prominently.

While, as in 6) above, removing linked lists from the relevant
structures helps, it's not that much. Staring at this for a long while
made me realize that, somewhat surprisingly to me, is that one of the
major problems is that we are bottlenecked on stack usage. Constantly
entering and leaving this many functions for trivial expressions
evaluations hurts considerably. Some of that is the additional numbers
of instructions, some of that is managing return jump addresses, and
some of that the actual bus traffic. It's also rather expensive to check
for stack limits at a very high rate.


Attached (in patch 0003) is a proof-of-concept implementing an
expression evalution framework that doesn't use recursion. Instead
ExecInitExpr2 computes a number of 'steps' necessary to compute an
expression. These steps are stored in a linear array, and executed one
after another (save boolean expressions, which can jump to later steps).
E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate
const, 3) call function.

Expression evaluation then is a switch statement over the opcodes of
each of these steps.

Using the machinery of those 'steps' to compute projections, quals, and
constraint evaluation then allows to reduce the number of indirect
function calls/jumps further.

My preliminary implementation so far implements only function calls,
boolean expression, constant evaluation and variable evaluation. For
everything else I'm falling back to the current expression machinery.

By combining expression, qual and target list processing, we can also
always generate the appropriate slot_getsomeattrs() calls.

Note that the patch currently does *NOT* support targetlist SRFs, and
instead just errors out. This is not a fundamental issue. I just didn't
want to invest time in supporting something we want to reimplement
anyway.   Similarly subplans currently don't work because
of:
    /*
     * Evaluate lefthand expressions and form a projection tuple. First we
     * have to set the econtext to use (hack alert!).
which doesn't work quite like that atm.


I've used (0004) a similar method to reduce the number of branches and
pipeline stalls in slot_deform_tuple considerably. For each attribute a
'step' is generated, which contains exactly the computations required to
deform that individual datum. That e.g. allows to separate cases for
different alignments and null-checks.


Having expression evaluation and slot deforming as a series of simple
sequential steps, instead of complex recursive calls, would also make it
fairly straightforward to optionally just-in-time compile those.


For motivation, here's some random performance differences:
SELECT SUM(l_quantity * l_extendedprice) FROM lineitem;
master: 5038.382 4965.310 4983.146
patches: 4194.593 4153.759 4168.986
tpch-q1
master: 21274.896
dev: 17952.678

For queries involving more complex expressions, the difference can be
a larger.


Both, expression processing and tuple deforming, can use considerable
additional improvements. But I think the approaches presented here are
the biggest step that I can see.


The reason that I'm bringing this up before submitting actual 'batch
operation' patches is that the architectural improvements are quickly
hidden behind these bottlenecks.


Before spending time polishing up these approaches, I'd like if anybody
fundamentally disagrees with either, or has a better proposal. If not
I'm hoping to first "properly" submit the slot deforming for review, and
then the expression evaluation. The latter would obviously need to be a
lot more complete than now; and we'd likely want to the targetlist SRF
rearchitecting beforehand.

Comments?

Regards,

Andres

Attachment

Re: Improving executor performance - tidbitmap

From
Andres Freund
Date:
Hi,

On 2016-06-24 16:29:53 -0700, Andres Freund wrote:
> 4) Various missing micro optimizations have to be performed, for more
>    architectural issues to become visible. E.g. [2] causes such bad
>    slowdowns in hash-agg workloads, that other bottlenecks are hidden.

One such issue is the usage of dynahash.c in tidbitmap.c. In many
queries, e.g. tpch q7, the bitmapscan is often the bottleneck. Profiling
shows that this is largely due to dynahash.c being slow. Primary issues
are: a) two level structure doubling the amount of indirect lookups b)
indirect function calls c) using separate chaining based conflict
resolution d) being too general.

I've quickly hacked up an alternative linear addressing hashtable
implementation. And the improvements are quite remarkable.

Example Query:
EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-12-31'::date);

before:

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                    QUERY PLAN
                          │ 

├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=942046.72..942046.73 rows=1 width=8) (actual time=4647.908..4647.909 rows=1 loops=1)
                          │ 
│   ->  Bitmap Heap Scan on lineitem  (cost=193514.84..919246.17 rows=9120222 width=8) (actual time=2667.821..3885.598
rows=9113219loops=1)       │ 
│         Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date))
                          │ 
│         Heap Blocks: exact=585542
                          │ 
│         ->  Bitmap Index Scan on i_l_shipdate  (cost=0.00..191234.78 rows=9120222 width=0) (actual
time=2461.622..2461.622rows=9113219 loops=1) │ 
│               Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date))
                          │ 
│ Planning time: 0.170 ms
                          │ 
│ Execution time: 4648.921 ms
                          │ 

└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

timing without analyze: 4136.425 4101.873 4177.441

after:

┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                   QUERY PLAN
                        │ 

├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate  (cost=942046.72..942046.73 rows=1 width=8) (actual time=3218.422..3218.423 rows=1 loops=1)
                        │ 
│   ->  Bitmap Heap Scan on lineitem  (cost=193514.84..919246.17 rows=9120222 width=8) (actual time=1153.707..2430.500
rows=9113219loops=1)     │ 
│         Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date))
                        │ 
│         Heap Blocks: exact=585542
                        │ 
│         ->  Bitmap Index Scan on i_l_shipdate  (cost=0.00..191234.78 rows=9120222 width=0) (actual
time=952.161..952.161rows=9113219 loops=1) │ 
│               Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date))
                        │ 
│ Planning time: 1.075 ms
                        │ 
│ Execution time: 3221.861 ms
                        │ 

└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(8 rows)

timing without analyze: 2647.364 2674.456 2680.197

as you can see the the time for the bitmap index scan goes from 2461.622
to 952.161.

I'm not proposing to apply the patch as-is, but I think it's a good
starting point to discuss how to evolve our use of hash tables.

I'm wondering whether we can do 'macro based templates' or
something. I.e. have something like the simplehash in the patch in
simplehash.h, but the key/value widths, the function names, are all
determined by macros (oh, this would be easier with C++ templates...).

Does anybody have a better idea?

The major issue with the simplehash implementation in the path is
probably the deletion; which should rather move cells around, rather
than use toombstones. But that was too complex for a POC ;). Also, it'd
likely need a proper iterator interface.

FWIW, the dynahash usage in nodeAgg.c is a major bottleneck in a number
of other queries.

Regards,

Andres

Attachment

Re: Improving executor performance

From
Andreas Seltenreich
Date:
Andres Freund writes:

> Having expression evaluation and slot deforming as a series of simple
> sequential steps, instead of complex recursive calls, would also make it
> fairly straightforward to optionally just-in-time compile those.

I don't think that JIT becomes easier by this change.  Constructing the
IR for LLVM, libFirm or any other JIT library from expression trees is
straightforward already.  It's probably more of a nuisance for those
that already have some code/design on JIT-compiling expressions
(vitessedb, ISP RAS, yours truly)

I like your patch for all the other reasons stated though!

regards
Andreas



Re: Improving executor performance

From
Andres Freund
Date:
On 2016-07-14 20:04:03 +0200, Andreas Seltenreich wrote:
> Andres Freund writes:
> 
> > Having expression evaluation and slot deforming as a series of simple
> > sequential steps, instead of complex recursive calls, would also make it
> > fairly straightforward to optionally just-in-time compile those.
> 
> I don't think that JIT becomes easier by this change.  Constructing the
> IR for LLVM, libFirm or any other JIT library from expression trees is
> straightforward already.  It's probably more of a nuisance for those
> that already have some code/design on JIT-compiling expressions
> (vitessedb, ISP RAS, yours truly)

The problem is that the previous form has a lot of ad-hoc analysis
strewn in. The interesting part is getting rid of all that. That's what
the new ExecInitExpr2() does. The target form can be both evaluated more
efficiently in the dispatch manner in the patch, and quite simply
converted to a JIT - without duplicating the analysis code.  I did write
a small ad-hoc x86 jit, and it was really quite straightforward this
way.

What did you do with JIT and expression evaluation? You basically just
replaced the toplevel ExprState note with a different evalfunc, pointing
into your code?

Andres



Re: Improving executor performance

From
Andreas Seltenreich
Date:
Andres Freund writes:

> The problem is that the previous form has a lot of ad-hoc analysis
> strewn in. The interesting part is getting rid of all that. That's what
> the new ExecInitExpr2() does. The target form can be both evaluated more
> efficiently in the dispatch manner in the patch, and quite simply
> converted to a JIT - without duplicating the analysis code.  I did write
> a small ad-hoc x86 jit, and it was really quite straightforward this
> way.

Ja, I see the advantage when doing ad-hoc-JIT compilation.

> What did you do with JIT and expression evaluation? You basically just
> replaced the toplevel ExprState note with a different evalfunc, pointing
> into your code?

That's the plan, yes.  I'm sorry there's no publishable code yet on the
the postgres side of things.  Using libFirm[1], the plan is to.

1. Automatically generate Firm-IR for the static C code around  expression evaluation as well operators in the system
catalog.

2. Construct IR for expression trees (essentially all the function calls  the executor would do).

3. Push libFirm's optimize button.  At this stage, most of the  dispatching goes away by inlining the calls including
functionsfrom  the catalog implementing operators.
 

4. Generate code and replace the toplevel expression node with a funcall  node.

I did implement this recipe with a toy Forth interpreter to see whether
libFirm was up to the job (Nobody's done JIT with libFirm before).  The
results were favorable[2].  Currently, a student at credativ is working
on applying these techniques to postgres.

regards,
Andreas

Footnotes: 
[1]  http://libfirm.org/

[2]  https://korte.credativ.com/~ase/firm-postgres-jit-forth.pdf




Re: Improving executor performance

From
Andres Freund
Date:
On 2016-07-14 23:03:10 +0200, Andreas Seltenreich wrote:
> That's the plan, yes.  I'm sorry there's no publishable code yet on the
> the postgres side of things.  Using libFirm[1], the plan is to.

Why libfirm? It seems to only have x86 and sparc backends, and no
windows support?


> 1. Automatically generate Firm-IR for the static C code around
>    expression evaluation as well operators in the system catalog.

> 2. Construct IR for expression trees (essentially all the function calls
>    the executor would do).

But that essentially means redoing most of execQual's current code in IR
- or do you want to do that via 1) above?  As long as the preparation
code (which is currently intermixed with the execution phase) isn't
separated, that means pulling essentially the whole backend given that
we do catalog lookups and everything during that phase.


> Currently, a student at credativ is working on applying these
> techniques to postgres.

Are you planning to support this to postgres proper?


Regards,

Andres



Re: Improving executor performance

From
Andreas Seltenreich
Date:
Andres Freund writes:

> On 2016-07-14 23:03:10 +0200, Andreas Seltenreich wrote:
>> That's the plan, yes.  I'm sorry there's no publishable code yet on the
>> the postgres side of things.  Using libFirm[1], the plan is to.
>
> Why libfirm?

- It has a more modern IR than LLVM (they're catching up though)
- It's very lightweight, compiles faster than LLVM's ./configure run
- I do have lots of experience with it and a committer bit

> It seems to only have x86 and sparc backends, and no windows support?

Ack, it's mostly used in research projects, that's why the number of
supported ISAs is small.  It's enough to answer the burning question
what speedup is to expected by jit-compiling things in the backend
though.  Also, if this thing actually takes off, adding more backends is
something that is easier with libFirm than LLVM, IMHO.

>> 1. Automatically generate Firm-IR for the static C code around
>>    expression evaluation as well operators in the system catalog.
>
>> 2. Construct IR for expression trees (essentially all the function calls
>>    the executor would do).
>
> But that essentially means redoing most of execQual's current code in IR
> - or do you want to do that via 1) above?

Manually re-doing backend logic in IR is a can of worms I do not want to
open.  This would guarantee bugs and result in a maintenance nightmare,
so doing 1) for the code is the only option when something turns out to
be a bottleneck IMHO.

> As long as the preparation code (which is currently intermixed with
> the execution phase) isn't separated, that means pulling essentially
> the whole backend given that we do catalog lookups and everything
> during that phase.

Right, the catalog lookups need to be done before JIT-compilation to
allow inlining operators.

>> Currently, a student at credativ is working on applying these
>> techniques to postgres.
>
> Are you planning to support this to postgres proper?

The goal is to publish it as an extension that sneaks into planner_hook.
I think BSD-licensing is ok as long as libfirm (LGPL) is kept as an
external dependency.

regards,
Andreas



Re: Improving executor performance - tidbitmap

From
Peter Geoghegan
Date:
On Wed, Jul 13, 2016 at 8:06 PM, Andres Freund <andres@anarazel.de> wrote:
> I've quickly hacked up an alternative linear addressing hashtable
> implementation. And the improvements are quite remarkable.
>
> Example Query:
> EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate <=
'1996-12-31'::date);

> timing without analyze: 4136.425 4101.873 4177.441
>
> after:

> timing without analyze: 2647.364 2674.456 2680.197
>
> as you can see the the time for the bitmap index scan goes from 2461.622
> to 952.161.

That's pretty great. I wonder what this would look like with a BRIN
index, since l_shipdate looks like a good candidate for BRIN indexing.


-- 
Peter Geoghegan



Re: Improving executor performance - tidbitmap

From
Andres Freund
Date:
On 2016-07-14 20:41:21 -0700, Peter Geoghegan wrote:
> On Wed, Jul 13, 2016 at 8:06 PM, Andres Freund <andres@anarazel.de> wrote:
> > I've quickly hacked up an alternative linear addressing hashtable
> > implementation. And the improvements are quite remarkable.
> >
> > Example Query:
> > EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate
<='1996-12-31'::date);
 
> 
> > timing without analyze: 4136.425 4101.873 4177.441
> >
> > after:
> 
> > timing without analyze: 2647.364 2674.456 2680.197
> >
> > as you can see the the time for the bitmap index scan goes from 2461.622
> > to 952.161.
> 
> That's pretty great. I wonder what this would look like with a BRIN
> index, since l_shipdate looks like a good candidate for BRIN indexing.

Brin indexes IIRC always end up using tidbitmap.c, so the benefits
should be there as well ;)



Re: Improving executor performance - tidbitmap

From
Peter Geoghegan
Date:
On Thu, Jul 14, 2016 at 8:45 PM, Andres Freund <andres@anarazel.de> wrote:
> Brin indexes IIRC always end up using tidbitmap.c, so the benefits
> should be there as well ;)

Right. Might the improvement be even more pronounced, though?

I'm not sure how a BRIN index with a suitable physical/logical
correlation performs compared to a bitmap index scan of a B-Tree (due
to a less selective bitmap index scan qual, as in your example), but
offhand I guess that could be faster in general, making the bottleneck
you're addressing relatively greater there.

-- 
Peter Geoghegan



Re: Improving executor performance - tidbitmap

From
Robert Haas
Date:
On Wed, Jul 13, 2016 at 11:06 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-06-24 16:29:53 -0700, Andres Freund wrote:
>> 4) Various missing micro optimizations have to be performed, for more
>>    architectural issues to become visible. E.g. [2] causes such bad
>>    slowdowns in hash-agg workloads, that other bottlenecks are hidden.
>
> One such issue is the usage of dynahash.c in tidbitmap.c. In many
> queries, e.g. tpch q7, the bitmapscan is often the bottleneck. Profiling
> shows that this is largely due to dynahash.c being slow. Primary issues
> are: a) two level structure doubling the amount of indirect lookups b)
> indirect function calls c) using separate chaining based conflict
> resolution d) being too general.
>
> I've quickly hacked up an alternative linear addressing hashtable
> implementation. And the improvements are quite remarkable.

Nice!

> I'm wondering whether we can do 'macro based templates' or
> something. I.e. have something like the simplehash in the patch in
> simplehash.h, but the key/value widths, the function names, are all
> determined by macros (oh, this would be easier with C++ templates...).
>
> Does anybody have a better idea?

No.

> The major issue with the simplehash implementation in the path is
> probably the deletion; which should rather move cells around, rather
> than use toombstones. But that was too complex for a POC ;). Also, it'd
> likely need a proper iterator interface.

Do we ever need to delete from a TIDBitmap?  Probably not, but I'm
guessing you have other uses for this in mind.

> FWIW, the dynahash usage in nodeAgg.c is a major bottleneck in a number
> of other queries.

Can we use this implementation for that as well, or are we going to
need yet another one?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Improving executor performance - tidbitmap

From
Andres Freund
Date:
On 2016-07-17 08:32:17 -0400, Robert Haas wrote:
> On Wed, Jul 13, 2016 at 11:06 PM, Andres Freund <andres@anarazel.de> wrote:
> > The major issue with the simplehash implementation in the path is
> > probably the deletion; which should rather move cells around, rather
> > than use toombstones. But that was too complex for a POC ;). Also, it'd
> > likely need a proper iterator interface.
>
> Do we ever need to delete from a TIDBitmap?  Probably not, but I'm
> guessing you have other uses for this in mind.

We do, via BitmapAnd.


> > FWIW, the dynahash usage in nodeAgg.c is a major bottleneck in a number
> > of other queries.
>
> Can we use this implementation for that as well, or are we going to
> need yet another one?

I've not tested it, but I'd assume that something like the simplehash
should work there. It's a bit more complicated of a scenario though.

Andres



Re: Improving executor performance

From
Peter Geoghegan
Date:
On Wed, Jul 13, 2016 at 6:18 PM, Andres Freund <andres@anarazel.de> wrote:
> While, as in 6) above, removing linked lists from the relevant
> structures helps, it's not that much. Staring at this for a long while
> made me realize that, somewhat surprisingly to me, is that one of the
> major problems is that we are bottlenecked on stack usage. Constantly
> entering and leaving this many functions for trivial expressions
> evaluations hurts considerably. Some of that is the additional numbers
> of instructions, some of that is managing return jump addresses, and
> some of that the actual bus traffic. It's also rather expensive to check
> for stack limits at a very high rate.

You'll recall how I complained how parallel CREATE INDEX, while
generally effective, became incredibly CPU bound on the still-serial
merge on my C collated text test case (I told you this in person, I
think). I looked into addressing this bottleneck, and made an
interesting discovery, which kind of reminds me of what you say here
about function call overhead.

I hacked up varstrfastcmp_c() to assume 4 byte varlena headers. No
function call to pg_detoast_datum_packed() is made (which otherwise
happens through all that DatumGetVarStringPP() macro indirection). Of
course, that assumption is dangerous in the general case, but it could
be made to work in most cases with a little care, as in practice the
vast majority of text comparisons are of text Datums with a 4 byte
varlena header. SortSupport could reliably detect if it was safe, with
a little help from the text opclass, or something along those lines.
This might not just be useful with sorting, but it does seem to be
particularly useful with parallel sorting.

Just on my laptop, this makes some parallel CREATE INDEX gensort test
cases [1] take as much as 15% less time to execute overall. That's a
big difference. I looked at the disassembly, and the number of
instructions for varstrfastcmp_c() was reduced from 113 to 29. That's
the kind of difference that could add up to a lot.

[1] https://github.com/petergeoghegan/gensort
-- 
Peter Geoghegan



Re: Improving executor performance

From
Doug Doole
Date:
On 07/13/2016 06:18 PM, Andres Freund wrote:
> Attached (in patch 0003) is a proof-of-concept implementing an
> expression evalution framework that doesn't use recursion. Instead
> ExecInitExpr2 computes a number of 'steps' necessary to compute an
> expression. These steps are stored in a linear array, and executed one
> after another (save boolean expressions, which can jump to later steps).
> E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate
> const, 3) call function.

We've been having trouble with the performance of simple expressions in 
PLpgSQL so I started playing with this patch. (No sense re-inventing the 
wheel after all.) It was straightforward to extend to simple expressions 
and showed an immediate improvement (~10% faster on a simple test). 
Running in our full environment highlighted a few areas that I think are 
worth more investigation.

However, before I tackle that, is the posted proof-of-concept still the 
latest and greatest? If not, any chance of getting the latest?

Going forward, I'd like to collaborate on our efforts if you're interested.

- Doug Doole
Salesforce



Re: Improving executor performance

From
Doug Doole
Date:
> Attached (in patch 0003) is a proof-of-concept implementing an
> expression evalution framework that doesn't use recursion. Instead
> ExecInitExpr2 computes a number of 'steps' necessary to compute an
> expression. These steps are stored in a linear array, and executed one
> after another (save boolean expressions, which can jump to later steps).
> E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate
> const, 3) call function.
We've been having trouble with the performance of simple expressions in 
PLpgSQL so I started playing with this patch. (No sense re-inventing the 
wheel after all.) It was straightforward to extend to simple expressions 
and showed an immediate improvement (~10% faster on a simple test). 
Running in our full environment highlighted a few areas that I think are 
worth more investigation.

However, before I tackle that, is the posted proof-of-concept still the 
latest and greatest? If not, any chance of getting the latest?

Going forward, I'd be happy to collaborate on our efforts.

- Doug Doole
Salesforce



Re: Improving executor performance

From
Andres Freund
Date:
Hi,

On 2016-11-04 15:25:58 -0700, Doug Doole wrote:
> On 07/13/2016 06:18 PM, Andres Freund wrote:
> > Attached (in patch 0003) is a proof-of-concept implementing an
> > expression evalution framework that doesn't use recursion. Instead
> > ExecInitExpr2 computes a number of 'steps' necessary to compute an
> > expression. These steps are stored in a linear array, and executed one
> > after another (save boolean expressions, which can jump to later steps).
> > E.g. to compute colname = 1 the steps are 1) fetch var, 2) evaluate
> > const, 3) call function.
>
> We've been having trouble with the performance of simple expressions in
> PLpgSQL so I started playing with this patch. (No sense re-inventing the
> wheel after all.) It was straightforward to extend to simple expressions and
> showed an immediate improvement (~10% faster on a simple test).

That's cool.


> Running in our full environment highlighted a few areas that I think
> are worth more investigation.

There indeed should be many of those.


> However, before I tackle that, is the posted proof-of-concept still the
> latest and greatest? If not, any chance of getting the latest?

It's definitely *not* the latest and greatest. My latest version
actually passes all regression tests, and has a few additional
expression types handled natively (NULL tests, CASE IIRC), and more than
few bugs fixed.

I'm away at the moment, but I plan to post a new version end of this or
beginning of next week.


> Going forward, I'd like to collaborate on our efforts if you're interested.

That sounds good!

Regards,

Andres