Thread: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From

Andres Freund

Date:

06 December 2016, 03:50:06

Hi Everyone,

TL;DR: Making things faster. Architectural evalation.

as some of you might be aware I've been working on making execution of
larger queries in postgresl faster. While working on "batched execution"
I came to the conclusion that, while necessary, isn't currently showing
a large benefit because expression evaluation and tuple deforming are
massive bottlenecks.

I'm posting a quite massive series of WIP patches here, to get some
feedback.

Tuple deforming is slow because of two reasons:

1) It's the first thing that accesses tuples, i.e. it'll often incur cache misses. That's partially fundamental, but
alsopartially can be addressed, e.g. through changing the access order in heap as in [1].

2) Tuple deforming has a lot of unpredicatable branches, because it has to cope with various types of fields. We e.g.
performalignment in a lot of unneeded cases, do null checks for NOT NULL columns et al.

I tried to address 2) by changing the C implementation. That brings some
measurable speedups, but it's not huge. A bigger speedup is making
slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers;
but it's still not huge. Finally I turned to just-in-time (JIT)
compiling the code for tuple deforming. That doesn't save the cost of
1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01). The
first part is done in 0008, the JITing in 0012.

Expression evaluation and projection is another major bottleneck.
1) Our recursive expression evaluation puts a *lot* of pressure on the stack.
2) There's a lot of indirect function calls when recursing to other expression nodes. These are hard to predict,
becausethe same node type (say ExecEvalAnd()) is used in different parts of an expression tree, and invokes different
sub-nodes.
3) The function calls to operators and other functions are hard to predict, leading to a significant number of
pipelinestalls.

4) There's a fair amount of pg_list.h list style iteration going on, those are cache and pipeline inefficient.

After some experimenting I came to the conclusion that the recursive
processing is a fundamental impediment to making this faster. I've
converted (0006) expression processing and projection into an opcode
dispatch based interpreter. That yields, especially for complex
expressions and larger projections a significant speedup in itself. But
similarly to the deforming, expression evaluation remains a bottleneck
after that, primarily because there's still a lot of unpredictable jump
and calls, and because loads/stores have to be complex
(e.g. ExprContext->ecxt_innertuple->tts_values[i]/tts_isnull[i] for a
single scalar var evaluation). Using the opcode based representation
of expression evaluation (as it's nearly linear, and has done a lot of
the lookups ahead of time), it's actually quite easy to

*After JITing expression evaluation itself is more than ten times faster
than before*.

But unfortunately that doesn't mean that queries are ten times faster -
usually we'll hit bottlenecks elsewhere relatively soon. WRT to
expression evaluation, the biggest cost afterwards are the relatively
high overhead V1 function calls - register based parameter passing is a
lot faster.

After experimenting a bit with doing JITing manually (a lot of
eye-stabbing kind of fun), I chose to use LLVM.

An overview of the patch-queue so far:
0001 Make get_last_attnums more generic.

Boring prerequisite.

0002 More efficient AggState->pertrans iteration.

Relatively boring minor optimization, but it turns out to be a easily
hit bottleneck. Will commit independently.

0003 Avoid materializing SRFs in the FROM list.
0004 Allow ROWS FROM to return functions as single record column.
0005 Basic implementation of targetlist SRFs via ROWS FROM.
0006 Remove unused code related to targetlist SRFs.

These are basically just pre-requisites for the faster expression
evaluation, and discussed elsewhere [2]. This implementation is *NOT*
going to survive, because we ended coming to the conclusion that using a
separate executor node to expand SRFs is a btter plan. But the new
expression evaluation code won't be able to handle SRFs...

0007 WIP: Optimize slot_deform_tuple() significantly.

This a) turns tuple deforming into an opcode based dispatch loop (using
computed goto on gcc/clang). b) moves a lot of the logic from
slot_deform_tuple() callsites into itself - that turns out to be more
efficient. I'm not entirely sure it's worth doing the opcode based
dispatch part, if we're going to also do the JIT bit - it's a fair
amount of code, and the speed difference only matters on large amounts
of rows.

0008 WIP: Faster expression processing and targetlist projection.

This, functionally nearly complete, patch turns expression evaluation
(and tuple deforming as a special case of that) into a "mini language"
which is interpreted using either a while(true) switch(opcode) or
computed goto to jump from opcode to opcode. It does so by moving a lot
more of the code for expression evaluation to initialization time and
building a linear series of steps to evaluate expressions, thereby
removing all recursion from expression processing.

This nearly entirely gets rid of the stack usage cost of expression
evaluation (we pretty much never recurse except for subplans). Being
able to remove, now redundant, calls to check_stack_depth() is a
noticeable benefit, it turns out that that check has a noticeable
performance impact (as it aparently forces to actually use the stack,
instead of just renumbering registers inside the CPU).

The new representation and evaluation is functionally nearly complete
(there's a single regression test failure, and I know why that is), but
the code needs a fair amount of polishing.

I do absolutely think that the fundamentals of this are the right way to
go, and I'm going to work hard on polishing the patch up. But this
isn't something that we can easily do in parts, and it's a huge ass
patch. So I'd like to have at least some more buyin before wasting even
more time on this.

0009 WIP: Add minimal keytest implementation.

More or less experimental patch that tries to implement simple
expression of the OpExpr(ScalarVar, Const) into a single expression
evaluation step. The benefits probably aren't big enough iff we do end
up doing JITing of expressions.

0010 WIP: Add configure infrastructure to enable LLVM.
0011 WIP: Beginning of a LLVM JIT infrastructure.

Very boring preliminary patches to add --with-llvm and some minimal
infrastructure to handle LLVM. If we go this way, JITed stuff needs to
be tied to resource owners, and we need some other centralized
infrastructure.

0012 Heavily-WIP: JITing of tuple deforming.

This, in a not-yet-that-nice manner, implements a JITed version of the
per-column stuff that slot_deform_tuple() does. It currently always
deforms all columns, which obviously would have to change. There's also
considerable additional performance improvements possible.

With this patch the per-column overhead (minus bitmap handling, which
0007 moved into a separate loop), drops from 10%+ into low single digits
for a number of queries. Afterwards the biggest cost is VARSIZE_ANY()
for varlena columns (which atm isn't inlined). That is, besides the
initial cache-miss when accessing tuple->t_hoff, which JITing can do
nothing about :(

This can be enabled/disabled using the new jit_tuple_deforming GUC. To
make this production ready in some form, we'd have to come up with a way
to determine when it's worth doing JITing. The easiest way would be to
do so after N slot_deform_tuple() calls or such, another way would be to
do it based on cost estimates.

0013 WIP: ExprEval: Make threaded dispatch use a separate field.

Boring preliminary patch. Increases memory usage a bit, needs to be
thought through more.

0014 Heavily-WIP: JITed expression evaluation.

This is the most-interesting bit performance wise. A few common types of
expressions are JITed. Scalar value accesses, function calls, boolean
expressions, aggregate references.

This can be enabled using the new jit_expressions GUC.

Even for the supported expression types I've taken some shortcuts
(e.g. strict functions aren't actually strict).

The performance benefits are quite noticeable. For TPCH ExecEvalExpr()
(which is where 0008 moved all of expression evaluation/projection) goes
from being the top profile entry, to barely noticeable, with the JITed
function usually not showing up in the top five entries anymore.

After the patch it becomes very clear that our function call
infrastructure is a serious bottlenecks. Passing all the arguments via
memory, and, even worse, forcing isnull/values to be on separate
cachelines, has significant performance implications. It also becomes
quite noticeable that nodeAgg's transition function invocation doesn't
go through ExecEvalExpr() but does that itself - which leads to constant
mispredictions if several transition values exist.

While the JIT code is relatively verbose, it turns out to not actually
be that hard to write after some startup pains. All the JITing of
expressions that exists so far was basically written in ~10 hours.

This also needs some heuristics about when JITing is
appropriate. Compiling an expression that's only executed once is never
going to be faster than doing the interpretation (it at least needs a
writable allocation for the code, and then a remap to make that code
read-only and executable). A trace based approach (everything executed
at least a thousand times) or cost based (all queries costing more than
100000 should be JITed) could make sense.

It's worthwhile to note that at the moment this is a per-query-execution
JIT, not something that can trivially be cached for prepared
statements. That'll need further infrastructure.

0015 Super-Heavily-WIP: LLVM perf integration.

This very very very preliminary patch (including some copy-pasted GPL
code!) creates /proc/perf-<pid>.map files, which allows perf to show
useful symbols for profile hits to JIT expressions. I plan to push this
towards LLVM, so this isn't something PG will have to do, but it's
helpful for evaluation.

I eventually plan to start separate threads about some of the parts in
here, but I think the overal picture needs some discussion first.

Q: Why LLVM and not a hand-rolled JIT?
A: Because hand-rolling a JIT is probably hard to scale to multiple maintainers, and multiple platforms. I started
downthe path of doing a hand-rolled x86 JIT, and that'd also be doable (faster compilation, slower execution
basically);but I doubt we'd end up having that on different architectures on platforms. Not to speak of things like
properdebugger and profiler integration. I'm not entirely convinced that that's the right path. It might also be a
transitionalstep, towards doing our completely own JIT. But I think it's a sensible step.

Q: Why LLVM and not $jit-toolkit
A: Because all the other JIT stuff I looked at was either really unportable (mostly x86 linux only), inconveniently
licensed(like e.g. gcc's jit library) or nearly unmaintained (luajit's stuff for example). I might have missed
something,but ISTM that atm the choice is between hand-rolling and using LLVM.

Q: Does this actually inline functions from the backend?
A: No. That probably is something desirable in the future, but to me that seems like it should be a separate step. The
currentone's big enough. It's also further increases compilation times, so quite possibly we only want to do so based
onanother set of heuristics.

Q: ?

Comments? Questions?

Regards,

Andres

[1] https://archives.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de
[2] https://www.postgresql.org/message-id/20160523005327.v2tr7obytitxcnna@alap3.anarazel.de

Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From

Andres Freund

Date:

06 December 2016, 03:52:14

On 2016-12-05 19:49:55 -0800, Andres Freund wrote:
> I'm posting a quite massive series of WIP patches here, to get some
> feedback.

And here's the patches themselves - let's hope they're not too big
(after gzip'ing that is).

Andres

Andres,

> dev (no jiting):> Time: 30343.532 ms

> dev (jiting):
> SET jit_tuple_deforming = on;
> SET jit_expressions = true;
>
> Time: 24439.803 ms

FYI, ~20% improvement for TPCH Q1 is consistent with what we find when we only jit expression.

Cheers,

-cktan

Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)

From

Andres Freund

Date:

13 December 2016, 05:14:46

Hi,

On 2016-12-12 18:11:13 -0800, CK Tan wrote:
> Andres,

> > dev (no jiting):
> > Time: 30343.532 ms
> 
> > dev (jiting):
> > SET jit_tuple_deforming = on;
> > SET jit_expressions = true;
> >
> > Time: 24439.803 ms
> 
> FYI, ~20% improvement for TPCH Q1 is consistent with what we find when we
> only jit expression.

For Q1 I think the bigger win is JITing the transition function
invocation in advance_aggregates/transition_function - that's IIRC where
the biggest bottleneck lies.

If you have any details about your JITing experience that you're willing
to talk about ...

Regards,

Andres

Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)

From

CK Tan

Date:

13 December 2016, 05:29:42

On Mon, Dec 12, 2016 at 6:14 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> For Q1 I think the bigger win is JITing the transition function
> invocation in advance_aggregates/transition_function - that's IIRC where
> the biggest bottleneck lies.
>

Yeah, we bundle the agg core into our expr work... no point otherwise since we do
it for OLAP.

As for experience, I think you have found out for yourself. There is a lot that
can be done and heuristics are involved in many places to decide whether
to jit fully, partially, or not at all. But it looks like you have a solid basis now
to proceed and explore the beyond :-)

Send me private email if you have a particular question.

Regards,
-cktan

Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)

From

Bruce Momjian

Date:

15 December 2016, 20:50:00

On Tue, Dec  6, 2016 at 11:10:59AM -0800, Andres Freund wrote:
> > I concur with your feeling that hand-rolled JIT is right out.  But
> > I'm not sure that whatever performance gain we might get in this
> > direction is worth the costs.
> 
> Well, I'm not impartial, but I don't think we do our users a service by
> leaving significant speedups untackled, and after spending a *LOT* of
> time on this, I don't see much other choice than JITing. Note that
> nearly everything performance sensitive is moving towards doing JITing
> in some form or another.

Agreed, we don't really have a choice.  After all the optimizations we
have done to so many subsystems, our executor is relatively slow and is
a major drag on our system, specifically for long-running queries.  The
base problem with the executor are the state machines at so many levels,
and we really can't optimize that while keeping a reasonable maintenance
burden.

This is where JIT and LLVM help.  I outlined two external projects that
were researching this in this blog entry:
http://momjian.us/main/blogs/pgblog/2016.html#April_1_2016

I am excited to now be seeing WIP code.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

[HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

09 February 2017, 03:33:12

Hi,

Here's an updated version of this patchset (now that SRFs are
implemented "properly").  This is a large patchset, with later patches
being more experimental than earlier ones.

0001: Add expression dependencies on composite type / whole row components.
  This patch later allows to remove some execution time check (in
  ExecEvalScalarVar / ExecEvalFieldSelect), which'd be inconvenient to
  do when JITed.

0002: Make get_last_attnums more generic.
  What it says.

0003: autoconf test for computed goto support.
  The faster expression evaluation implementation uses computed gotos
  when available. So we need to check whether the current compiler
  supports those.

0004: WIP: Faster expression processing and targetlist projection.
  The most important patch of the series. Changes expression evaluation
  from being tree-walk based as done in execQual.c, into being a series
  of instructions.  Then adds an opcode dispatch based interpreter for
  that mini language.  That allows to later add different
  implementations.  Projection is now a special case for expression
  evaluation.

0005: WIP: Add configure infrastructure to enable LLVM.
0006: WIP: Beginning of a LLVM JIT infrastructure.
  Needs work.

0007: WIP: JITed expression evaluation.
  This finally performs runtime compilation of expressions.


I think up to 0004 things are getting pretty close to being ready.
There's some minor FIXMEs that need to addressed (primarily around
CaseTestExpr/CoerceToDomainValue when invoked outside of plans), and
some more comment work.

The biggest remaining piece of work in the whole series is 0006. Right
compiled expressions can be leaked, which is obviously not ok. So I need
to write ResourceOwner integration.

0007 covers nearly all types of expressions /queries (function usage
isn't supported right now, and there's two missing calls to
MakeExpandedObjectReadOnly).  There's a lot more that could be done for
efficiency (generating smarter LLVM IR basically), but we already see
quite significant performance wins.


I have later patches that
- centralizes tuple deforming logic into slot_deform_logic from
  slot_getattr, slot_getsomeatts,slot_getallattrs and such
- add JITing of tuple deforming
- add JITing of aggregation transitions
  this works by converting transition functions into expression opcodes,
  and then adding JIT support for those new instructions
  (EEO_AGG_FILTER / STRICT_INPUT_CHECK / INIT_TRANS / STRICT_TRANS_CHECK
  / PLAIN_TRANS / ORDERED_TRANS_DATUM / ORDERED_TRANS_TUPLE)

but the above 7 patches already are large enough imo.


For now 0004 is I think the patch that really could use a look by
somebody but me - it's quite a radical departure from how things
currently look in master, and it's a fairly large patch.  I don't really
see a good way to break it down into useful parts :( (the projection
changes are the only thing that I can see usefully being done
separately).

Regards,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi,

Attached is an updated version of the patchset, but more importantly
some benchmark numbers.

I ran tpch scale 5, on my laptop with
-c shared_buffers=20GB # all data fits into memory
-c max_parallel_workers_per_gather=0 # reduce variability
-c work_mem=4GB # other suggestions welcome
-c huge_pages=on #reduces variability

My benchmarking script first prewarms the whole database, then runs the
tpch queries in sequence, repeated three times, and compares the shortes
execution time:

master q01 min: 16176.843 dev-llvm-off min: 13153.179 [diff -22.99]
master q02 min: 1280.691 dev-llvm-off min: 1263.032 [diff -1.40]
master q03 min: 7330.737 dev-llvm-off min: 7144.982 [diff -2.60]
master q04 min: 1014.347 dev-llvm-off min: 991.008 [diff -2.36]
master q05 min: 5490.103 dev-llvm-off min: 5439.878 [diff -0.92]
master q06 min: 1916.45 dev-llvm-off min: 1818.839 [diff -5.37]
master q07 min: 5282.129 dev-llvm-off min: 5222.879 [diff -1.13]
master q08 min: 1655.824 dev-llvm-off min: 1532.1 [diff -8.08]
master q09 min: 7009.372 dev-llvm-off min: 6724.515 [diff -4.24]
master q10 min: 6017.01 dev-llvm-off min: 5848.337 [diff -2.88]
master q11 min: 316.724 dev-llvm-off min: 292.51 [diff -8.28]
master q12 min: 4829.502 dev-llvm-off min: 4698.14 [diff -2.80]
master q13 min: 8679.991 dev-llvm-off min: 8427.614 [diff -2.99]
master q14 min: 814.109 dev-llvm-off min: 774.805 [diff -5.07]
master q15 min: 1957.248 dev-llvm-off min: 1841.377 [diff -6.29]
master q16 min: 1976.544 dev-llvm-off min: 1936.932 [diff -2.05]
master q17 min: 559.664 dev-llvm-off min: 446.199 [diff -25.43]
master q18 min: 16565.722 dev-llvm-off min: 15475.372 [diff -7.05]
master q19 min: 385.222 dev-llvm-off min: 310.398 [diff -24.11]
master q20 min: 1717.015 dev-llvm-off min: 1389.064 [diff -23.61]
master q22 min: 753.017 dev-llvm-off min: 637.152 [diff -18.18]

(note that there's a fair amount of per-run variance, I've seen one or
two of the small differences go the other way in previous runs)

It's clearly visible that the performance gain heavily depends on the
type of query.  Which makes sense - expression evaluation isn't a
bottleneck everywhere.  The actual gains in expression evalution are
larger than the maximum here, but bottlenecks shift.

Besides cleanups the important changes in this version of the patches
are:
- I've added a fastpath for the interpretation of very simple
  expressions (non-sysvar, single column Vars and Consts), before that
  performance regressed slightly for cases that evaluated a *lot* of
  Vars/Consts. The only case where I could actually reproduce that is in
  large hash-joins where the to-be-hashed value is extracted on its own.[1;5A
- I moved the invocation of expression evaluation back to a
  callback. That's better for predicting branches both with JITing and
  when using fastpath functions.
- removed used EXEC_EVALDEBUG code

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 2017-03-10 09:00:14 -0500, Peter Eisentraut wrote:
> I have also looked at the 0002 and 0003 patches, and they seem OK, but
> they are obviously not of much use without 0004. What is your ambition
> for getting 0004 reviewed and committed for PG10?

I'd really like to get it in. The performance improvements on its own
are significant, and it provides the basis for significantly larger
improvements again (JIT) - those followup improvements are large patches
again though, so I'd rather not do all of that next cycle.

My next step (over the weekend) is to add tests to execQual.c to get it
a good chunk closer to 100% test coverage, and then do the same for the
new implementation (there's probably very little additional tests needed
after the conversion). Given all tests pass before/after, and there's a
lot of them, I think we can have a reasonable confidence of a low bug
density.

Large parts of the changes are fairly mechanical, the interesting bits
probably are:

- there's a lot fewer "full blown" node types for expression evaluation anymore. In most cases there's now only a
top-levelExprState node that's nodeTag() able. The actual information to execute an instruction now is in
ExprState->steps[i]
- because of the previous point, it now isn't legal anymore to call ExecInitExpr() on a list of expressions -
ExecInitExprList()is a convenience wrapper around manually iterating over the list (gets rid of ugly casts, too)

- Because they behave differently on an actual expression evaluation basis, quals / check constraint, now have to be
initializedwith ExecInitQual/ExecInitCheck(). That way the shortcut behaviour and such can be retained, while also
beingfaster.

- PlanState->targetlist is gone. Because expressions aren't initialized on their own anymore, but as part of
ExecProject,there's no need for it.

- The seperation between ExecInitExprRec() and ExecEvalExpr/ExecInterpExpr() is different - we try to do as much as
possibleduring initialization.

- I retained some ugly hackering around innermost_caseval|null/domainval|null - I had started down a path of removing
theuse of those in random parts of the system, but I ended up not going for it.

Greetings,

Andres Freund

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Peter Eisentraut

Date:

11 March 2017, 03:13:25

On 3/10/17 14:24, Andres Freund wrote:
> What problem are you thinking of exactly, and what'd be the solution?
> I'd guess that people wouldn't like being unable to change columns in a
> table if some function returned the type.

Well, that's pretty much the problem.

For example:

create type t1 as (a int, b int);
create function f1() returns setof t1 language plpgsql   as $$ begin return query values (1, 2); end $$;
alter type t1 drop attribute b;
select * from f1();  -- fail

The dependency system should arguably guard against that.  Yes, it would
make some things more cumbersome, but, well, it already does that.

>> Not necessarily in this patch, but I would like to keep the options
>> open.
> 
> Yea, seems worth thinking about.  I don't think the patch closes down
> options?

nope

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Peter Eisentraut

Date:

11 March 2017, 03:15:59

On 3/10/17 14:40, Andres Freund wrote:
> I'd really like to get it in.  The performance improvements on its own
> are significant, and it provides the basis for significantly larger
> improvements again (JIT) - those followup improvements are large patches
> again though, so I'd rather not do all of that next cycle.
> 
> My next step (over the weekend) is to add tests to execQual.c to get it
> a good chunk closer to 100% test coverage, and then do the same for the
> new implementation (there's probably very little additional tests needed
> after the conversion).  Given all tests pass before/after, and there's a
> lot of them, I think we can have a reasonable confidence of a low bug
> density.

That seems like a plan, but now would probably be a good time for some
other hackers to take note of this proposal.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Bruce Momjian

Date:

11 March 2017, 03:42:25

On Fri, Mar 10, 2017 at 07:15:59PM -0500, Peter Eisentraut wrote:
> On 3/10/17 14:40, Andres Freund wrote:
> > I'd really like to get it in.  The performance improvements on its own
> > are significant, and it provides the basis for significantly larger
> > improvements again (JIT) - those followup improvements are large patches
> > again though, so I'd rather not do all of that next cycle.
> > 
> > My next step (over the weekend) is to add tests to execQual.c to get it
> > a good chunk closer to 100% test coverage, and then do the same for the
> > new implementation (there's probably very little additional tests needed
> > after the conversion).  Given all tests pass before/after, and there's a
> > lot of them, I think we can have a reasonable confidence of a low bug
> > density.
> 
> That seems like a plan, but now would probably be a good time for some
> other hackers to take note of this proposal.

Well, the executor is long overdue for improvement, so I would love to
have this in 10.0.  I am not sure what additional polishing will happen
if we punt it for 11.0.  I think the only downside of having it in 10.0
is that it will not have lived in the source tree for as long a time
between commit and PG release, but if it is tested, that seems fine.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Robert Haas

Date:

12 March 2017, 05:24:48

On Fri, Mar 10, 2017 at 7:42 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Mar 10, 2017 at 07:15:59PM -0500, Peter Eisentraut wrote:
>> On 3/10/17 14:40, Andres Freund wrote:
>> > I'd really like to get it in.  The performance improvements on its own
>> > are significant, and it provides the basis for significantly larger
>> > improvements again (JIT) - those followup improvements are large patches
>> > again though, so I'd rather not do all of that next cycle.
>> >
>> > My next step (over the weekend) is to add tests to execQual.c to get it
>> > a good chunk closer to 100% test coverage, and then do the same for the
>> > new implementation (there's probably very little additional tests needed
>> > after the conversion).  Given all tests pass before/after, and there's a
>> > lot of them, I think we can have a reasonable confidence of a low bug
>> > density.
>>
>> That seems like a plan, but now would probably be a good time for some
>> other hackers to take note of this proposal.
>
> Well, the executor is long overdue for improvement, so I would love to
> have this in 10.0.  I am not sure what additional polishing will happen
> if we punt it for 11.0.  I think the only downside of having it in 10.0
> is that it will not have lived in the source tree for as long a time
> between commit and PG release, but if it is tested, that seems fine.

Yeah.  I think we can't rule out the possibility that this patch is
full of bugs, but (1) a lot of the change is pretty mechanical and (2)
none of this touches the on-disk format or the catalog contents, so it
doesn't break anybody's existing install if we have to patch it.  On
the other hand, (3) it's a big patch and (4) if expression evaluation
doesn't work, PostgreSQL is pretty much unusable.

So my feeling is:

1. If Andres commits this and it turns out to be seriously buggy, he
should be prepared to revert it without a big fight.  We can try again
for v11.

2. If Andres commits this and it turns out to be mildly buggy, he
should be prepared to address bugs very proactively and very promptly.

3. If Andres wants to commit this for v10, it should be done RSN.
Feature freeze isn't technically until the end of March, but giant
potentialy-destabilizing patches that land on March 31st are likely to
lead to outcomes that nobody wants.

Like, I think, everyone else, I'd really like to have these speedups
and I think they will benefit everyone who uses PostgreSQL.  However,
I think that we cannot afford to have a repeat of the simplehash thing
where serious bugs sat around for months without really getting
properly addressed.  If that happens with this, I will be extremely
unhappy, and I'm fairly sure I won't be alone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Tomas Vondra

Date:

12 March 2017, 07:40:51

Hi,

On 03/07/2017 04:01 AM, Andres Freund wrote:
> Hi,
>
> Attached is an updated version of the patchset, but more importantly
> some benchmark numbers.
>

I wanted to do a bit of testing and benchmarking on this, but 0004 seems 
to be a bit broken. The patch does not apply anymore - there are some 
conflicts in execQual.c, but I think I fixed those. But then I ran into 
a bunch of compile-time errors, because some of the executor nodes still 
reference bits that were moved elsewhere.

E.g. there's no targetlist in PlanState anymore, yet nodeGatherMerge and 
nodeTableFuncscan do this:
gm_state->ps.targetlist = (List *)    ExecInitExpr((Expr *) node->plan.targetlist,                 (PlanState *)
gm_state);

Some of the nodes also assign to ps.qual values that are (List *), but 
that field was switched to ExprState. That needs fixing, I guess.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

13 March 2017, 11:03:51

Hi,

On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:
> I wanted to do a bit of testing and benchmarking on this, but 0004 seems to
> be a bit broken.

Well, "broken" in the sense that it's already outdated, because other
stuff that got merged.


> The patch does not apply anymore - there are some conflicts
> in execQual.c, but I think I fixed those. But then I ran into a bunch of
> compile-time errors, because some of the executor nodes still reference bits
> that were moved elsewhere.

Updated patch attached.  Note that this patch has two changes I've not
yet evaluated performance-wise.


Questions:
- New code calls UpdateDomainConstraintRef at the beginning of
  expression evaluation, not for each row.  That allows to generate
  "proper programs" for constraint checking, which is quite useful for
  performance.  But it's a behaviour change.
- I have *not* changed that, but I'm less than convinced of the pattern
  of "runtime" get_cached_rowtype calls.

Changes:
- comments
- fix minor bug in BooleanTest brought to ligth thanks to the new
  regression tests (No, IS TRUE isn't the same IS NOT FALSE)
- fix (theoretical?) issue with get_cached_rowtype() being called once
  for deforming and once for forming a tuple for FieldStores.
- fixed/implemented an optimization XXX around bool AND/OR because the
  solution actually makes the code easier.

What's basically missing here is:
- pgindent run and minimizing resulting damage
- A higher level explanation of how the interpreted expression
  evaluation works.  This patch implements "fairly standard" approach,
  but it's imo sufficiently
- There's one remaining FIXME that needs to be fixed.
- I sometimes see a minor performance regression (~1.2%) in tpch-h's Q17
  that might or might not be code layout related.
- continue pass to ensure all worthwhile comments from execQual.c are
  retained.

Could be delayed:
- add regression tests for track_functions - I've manually verified it
  works, and we previously didn't have tests either.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Konstantin Knizhnik

Date:

13 March 2017, 20:10:48

On 13.03.2017 11:03, Andres Freund wrote:

Hi,

On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:

I wanted to do a bit of testing and benchmarking on this, but 0004 seems to
be a bit broken.

Well, "broken" in the sense that it's already outdated, because other
stuff that got merged.

The patch does not apply anymore - there are some conflicts
in execQual.c, but I think I fixed those. But then I ran into a bunch of
compile-time errors, because some of the executor nodes still reference bits
that were moved elsewhere.

Updated patch attached.  Note that this patch has two changes I've not
yet evaluated performance-wise.

I got the following results at my system with Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 16Gb RAM,
TPC-H Q1/Q6 scale 10, sharedBuffers=8Gb, pg_prewarm on lineitem table projection:

	Q1	Q6
Master	7503 ms	1171 ms
Your patch	6420 ms	1034 ms
VOPS	396 ms	249 ms
VOPS + patch	367 ms	233 ms

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Tomas Vondra

Date:

14 March 2017, 00:37:51

On 03/13/2017 09:03 AM, Andres Freund wrote:
> Hi,
>
> On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:
>> I wanted to do a bit of testing and benchmarking on this, but 0004 seems to
>> be a bit broken.
>
> Well, "broken" in the sense that it's already outdated, because other
> stuff that got merged.
>

Yes, that's what I meant. Sorry for not being quite clear.

>
>> The patch does not apply anymore - there are some conflicts
>> in execQual.c, but I think I fixed those. But then I ran into a bunch of
>> compile-time errors, because some of the executor nodes still reference bits
>> that were moved elsewhere.
>
> Updated patch attached.  Note that this patch has two changes I've not
> yet evaluated performance-wise.
>

And which are those?

I plan to do a bit of testing / benchmarking on this later this week, 
depending on other testing already happening on my machine.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

14 March 2017, 04:11:04

Hi,

On 2017-03-13 01:03:51 -0700, Andres Freund wrote:
> What's basically missing here is:
> - pgindent run and minimizing resulting damage

Running into a bit of an issue here - pgindent mangles something like
   EEO_SWITCH (op->opcode)   {       EEO_CASE(EEO_DONE):           goto out;
       EEO_CASE(EEO_INNER_FETCHSOME):           /* XXX: worthwhile to check tts_nvalid inline first? */
slot_getsomeattrs(innerslot,op->d.fetch.last_var);           EEO_DISPATCH(op);
 
       EEO_CASE(EEO_OUTER_FETCHSOME):           slot_getsomeattrs(outerslot, op->d.fetch.last_var);
EEO_DISPATCH(op);
       EEO_CASE(EEO_SCAN_FETCHSOME):           slot_getsomeattrs(scanslot, op->d.fetch.last_var);
EEO_DISPATCH(op);
       EEO_CASE(EEO_INNER_VAR):           {               int attnum = op->d.var.attnum;
               /*                * Can't assert tts_nvalid, as wholerow var evaluation or such                * could
havematerialized the slot - but the contents are                * still valid :/                */
Assert(op->d.var.attnum>= 0);               *op->resnull = innerslot->tts_isnull[attnum];               *op->resvalue =
innerslot->tts_values[attnum];              EEO_DISPATCH(op);           }
 

into
   EEO_SWITCH(op->opcode)   {
EEO_CASE(EEO_DONE):       goto out;

EEO_CASE(EEO_INNER_FETCHSOME):       /* XXX: worthwhile to check tts_nvalid inline first? */
slot_getsomeattrs(innerslot,op->d.fetch.last_var);       EEO_DISPATCH(op);
 

EEO_CASE(EEO_OUTER_FETCHSOME):       slot_getsomeattrs(outerslot, op->d.fetch.last_var);       EEO_DISPATCH(op);

EEO_CASE(EEO_SCAN_FETCHSOME):       slot_getsomeattrs(scanslot, op->d.fetch.last_var);       EEO_DISPATCH(op);

EEO_CASE(EEO_INNER_VAR):       {           int         attnum = op->d.var.attnum;
           /*            * Can't assert tts_nvalid, as wholerow var evaluation or such            * could have
materializedthe slot - but the contents are still            * valid :/            */           Assert(op->d.var.attnum
>=0);           *op->resnull = innerslot->tts_isnull[attnum];           *op->resvalue = innerslot->tts_values[attnum];
        EEO_DISPATCH(op);       }
 


which is a bit annoying.  (the EEO_CASE is either a jump label or a case
statement, depending on computed goto availability).

It seems we could either:
1) live with the damage
2) disable pgindent
3) move the : inside EEO_CASE's definition, and only use {} blocks.

I'm inclined to go for 3).

Opinions?

- Andres

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

14 March 2017, 09:53:00

Hi,

On 2017-03-13 22:37:51 +0100, Tomas Vondra wrote:
> On 03/13/2017 09:03 AM, Andres Freund wrote:
> > Hi,
> >
> > On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:
> > > I wanted to do a bit of testing and benchmarking on this, but 0004 seems to
> > > be a bit broken.
> >
> > Well, "broken" in the sense that it's already outdated, because other
> > stuff that got merged.
> >
>
> Yes, that's what I meant. Sorry for not being quite clear.

And the patch already again has conflicts. Updated version, with a good
chunk of other changes, attached.

> >
> > > The patch does not apply anymore - there are some conflicts
> > > in execQual.c, but I think I fixed those. But then I ran into a bunch of
> > > compile-time errors, because some of the executor nodes still reference bits
> > > that were moved elsewhere.
> >
> > Updated patch attached.  Note that this patch has two changes I've not
> > yet evaluated performance-wise.
> >
>
> And which are those?

Primarily around minor performance regressions for the evaluation of
very simple (i.e. single Var/Const) expressions.  I did a fair amount of
work hunting these down.


Besides that, this version has:
- pgindented most of the affected pieces (i.e. all significant new code
  has been reindent, not all touched one)
- Transported missing comments from execQual.c - afaics all relevant
  ones have been moved over
- Replaced ExprState->cb with evalfunc - that's what it was named
  previously, and I see no reason to diverge.
- significantly expanded header comments, especially in
  execInterpExpr.c, explaining the basic approach of expression
  evaluation.  Feedback here would be greatly appreciated - I've spent
  so much time neck deep into this, that I don't quite know in how much
  detail to go.
- Initial polished commit messages and such.


My current plan is to not do very much on this tomorrow, do another full
pass on Wednesday, and push it, unless there's protest till then.


> I plan to do a bit of testing / benchmarking on this later this week,
> depending on other testing already happening on my machine.

Cool.


I've performed a bunch more benchmarks, and the results have been pretty
favorable.  Due to some more work hunting down minor regressions I can't
make any tpc-h query perform worse than on master, with either
best/average of three executions.  I also did a bunch of runs with
parallelism, and the results there are also pretty good - even better
than non-parallel sometimes.

These results include parallelism, but were done *before* rebasing ontop
of Noah's changes.  There was also, during all runs, a browser and emacs
running, so it's certainly possible that there's some skew.

-par is with parallelism enabled to 8, with is without.  Otherwise same
setup as two mails up.

master-par q01 min: 4629.643 master min: 15108.813 [diff +69.36] dev min: 12629.0 [diff +63.34] dev-par min: 3458.899
[diff-33.85]
 
master-par q02 min: 1048.511 master min: 1196.596 [diff +12.38] dev min: 1182.652 [diff +11.34] dev-par min: 1044.876
[diff-0.35]
 
master-par q03 min: 4651.492 master min: 6949.525 [diff +33.07] dev min: 6441.467 [diff +27.79] dev-par min: 3928.919
[diff-18.39]
 
master-par q04 min: 698.037 master min: 981.081 [diff +28.85] dev min: 947.506 [diff +26.33] dev-par min: 341.855 [diff
-104.19]
master-par q05 min: 871.528 master min: 5119.728 [diff +82.98] dev min: 4979.498 [diff +82.50] dev-par min: 702.3 [diff
-24.10]
master-par q06 min: 1107.307 master min: 1807.433 [diff +38.74] dev min: 1694.513 [diff +34.65] dev-par min: 738.037
[diff-50.03]
 
master-par q07 min: 4867.412 master min: 4745.141 [diff -2.58] dev min: 4682.068 [diff -3.96] dev-par min: 4207.644
[diff-15.68]
 
master-par q08 min: 432.159 master min: 1465.219 [diff +70.51] dev min: 1416.691 [diff +69.50] dev-par min: 418.331
[diff-3.31]
 
master-par q09 min: 3965.76 master min: 6441.099 [diff +38.43] dev min: 6220.095 [diff +36.24] dev-par min: 3875.673
[diff-2.32]
 
master-par q10 min: 1129.351 master min: 5818.318 [diff +80.59] dev min: 5406.125 [diff +79.11] dev-par min: 1107.759
[diff-1.95]
 
master-par q11 min: 247.598 master min: 291.061 [diff +14.93] dev min: 277.18 [diff +10.67] dev-par min: 241.983 [diff
-2.32]
master-par q12 min: 1532.855 master min: 4552.034 [diff +66.33] dev min: 4431.584 [diff +65.41] dev-par min: 1395.703
[diff-9.83]
 
master-par q13 min: 8638.503 master min: 8009.64 [diff -7.85] dev min: 7891.661 [diff -9.46] dev-par min: 8314.437
[diff-3.90]
 
master-par q14 min: 723.878 master min: 764.35 [diff +5.29] dev min: 730.526 [diff +0.91] dev-par min: 717.66 [diff
-0.87]
master-par q15 min: 1991.915 master min: 1809.162 [diff -10.10] dev min: 1716.983 [diff -16.01] dev-par min: 1879.927
[diff-5.96]
 
master-par q16 min: 1596.643 master min: 1843.553 [diff +13.39] dev min: 1728.467 [diff +7.63] dev-par min: 1429.103
[diff-11.72]
 
master-par q17 min: 354.357 master min: 418.761 [diff +15.38] dev min: 414.958 [diff +14.60] dev-par min: 322.471 [diff
-9.89]
master-par q18 min: 11707.479 master min: 14216.567 [diff +17.65] dev min: 14206.893 [diff +17.59] dev-par min:
11642.863[diff -0.55]
 
master-par q19 min: 116.819 master min: 325.531 [diff +64.11] dev min: 287.497 [diff +59.37] dev-par min: 102.909 [diff
-13.52]
master-par q20 min: 430.85 master min: 500.92 [diff +13.99] dev min: 478.536 [diff +9.96] dev-par min: 410.548 [diff
-4.95]
master-par q22 min: 460.712 master min: 626.207 [diff +26.43] dev min: 595.708 [diff +22.66] dev-par min: 434.851 [diff
-5.95]

Thees seem like pretty neat results.  Sorry for not cleaning up the
representation - too tired.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

On 03/14/2017 08:53 AM, Andres Freund wrote:
> Besides that, this version has:
> - pgindented most of the affected pieces (i.e. all significant new code
>   has been reindent, not all touched one)

I think you'll need to add all the inner structs ExprEvalStep 
typedefs.list to indent them right.

> My current plan is to not do very much on this tomorrow, do another full
> pass on Wednesday, and push it, unless there's protest till then.

I looked at patch 0004. Some comments:

* EEO_DISPATCH_DIRECT(op) takes 'op' as an argument, but it really 
assumes that 'op' has already been set to point to the jump target. I 
find that a bit weird. I guess the idea is that you always pass the 
Program Counter variable as 'op' argument. For consistency, would be 
good if EEO_SWITCH() also took just 'op' as the argument, rather than 
op->opcode. But I think it would be more clear if they should both just 
assumed that there's a variable called 'op' that points to the current 
instruction.

* All the callers of EEO_DISPATCH_DIRECT(op) set 'op' just prior to 
calling EEO_DISPATCH_DIRECT(op). How about having a macro EEO_JUMP(<step 
number>), to encapsulate setting 'op' and jumping to it in one operation?

* ExecEvalStepOp() seems relatively expensive, with the linear scan over 
all the opcodes, if called on an ExprState that already has 
EEO_FLAG_JUMP_THREADED set. All the callers use it to check if the 
opcode is a particular one, so you could check if the opcode matches 
that particular one, instead of scanning the dispatch table to find what 
it is.

* But is ExecEvalStepOp() ever actually get called on an expression with 
EEO_FLAG_JUMP_THREADED already set? It's only used in 
ExecInstantiateInterpretedExpr(), and it's called only once on each 
ExprState. How about just adding an Assert that EEO_FLAG_JUMP_THREADED 
is not yet set? Or drop the EEO_FLAG_JUMP_THREADED flag altogether, and 
assert that evalfunc != ExecInterpExpr.

* How tight are we on space in the ExprEvalStep union? Currently, the 
jump-threading preparation replaces the opcodes with the goto labels, 
but it would be really nice to keep the original opcodes, for debugging 
purposes if nothing else.

* execInterpExpr.c is quite a mouthful. How about exprInterpreter.c?

Attached is a patch, on top of your 
0004-Faster-expression-evaluation-and-targetlist-projecti.patch, with 
some minor tweaks and comments here and there (search for HEIKKI). It's 
mostly the same stuff I listed above, implemented in a quick & dirty 
way. I hope it's helpful.

- Heikki

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

faster-eval-comments.patch

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

14 March 2017, 20:31:37

Hi,


On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote:
> On 03/14/2017 08:53 AM, Andres Freund wrote:
> > Besides that, this version has:
> > - pgindented most of the affected pieces (i.e. all significant new code
> >   has been reindent, not all touched one)
> 
> I think you'll need to add all the inner structs ExprEvalStep typedefs.list
> to indent them right.

Yea, I saw that.  Since they're not named atm, That'd probably not work
well.  I'd be inclined to just live with it :/


> I looked at patch 0004.

Thanks!


> * EEO_DISPATCH_DIRECT(op) takes 'op' as an argument, but it really assumes
> that 'op' has already been set to point to the jump target. I find that a
> bit weird. I guess the idea is that you always pass the Program Counter
> variable as 'op' argument. For consistency, would be good if EEO_SWITCH()
> also took just 'op' as the argument, rather than op->opcode. But I think it
> would be more clear if they should both just assumed that there's a variable
> called 'op' that points to the current instruction.

Hm.  I dislike magic variable names. I think I prefer your idea of
passing the op entirely to EEO_SWITCH.


> * All the callers of EEO_DISPATCH_DIRECT(op) set 'op' just prior to calling
> EEO_DISPATCH_DIRECT(op). How about having a macro EEO_JUMP(<step number>),
> to encapsulate setting 'op' and jumping to it in one operation?

Ok.


> * ExecEvalStepOp() seems relatively expensive, with the linear scan over all
> the opcodes, if called on an ExprState that already has
> EEO_FLAG_JUMP_THREADED set. All the callers use it to check if the opcode is
> a particular one, so you could check if the opcode matches that particular
> one, instead of scanning the dispatch table to find what it is.

It should be fairly cheap at that place, unless the same expression is
instantiated twice for some reason.  I have been wondering about adding
a second table ptr->op that can be binary searched for the operation to
make it cheaper.


> * But is ExecEvalStepOp() ever actually get called on an expression with
> EEO_FLAG_JUMP_THREADED already set? It's only used in
> ExecInstantiateInterpretedExpr(), and it's called only once on each
> ExprState. How about just adding an Assert that EEO_FLAG_JUMP_THREADED is
> not yet set? Or drop the EEO_FLAG_JUMP_THREADED flag altogether, and assert
> that evalfunc != ExecInterpExpr.

The primary use of ExecEvalStepOp() is really debugging, so I'd like to
avoid adding that restriction.  I guess we can add a second function for
this if needed.


> * How tight are we on space in the ExprEvalStep union? Currently, the
> jump-threading preparation replaces the opcodes with the goto labels, but it
> would be really nice to keep the original opcodes, for debugging purposes if
> nothing else.

There's nothing left to spare :(.  For debugging I've just used
ExecEvalStepOp() - it's inconvenient to directly print the struct
anyway, since gdb prints the whole union...


> * execInterpExpr.c is quite a mouthful. How about exprInterpreter.c?



> Attached is a patch, on top of your
> 0004-Faster-expression-evaluation-and-targetlist-projecti.patch, with some
> minor tweaks and comments here and there (search for HEIKKI). It's mostly
> the same stuff I listed above, implemented in a quick & dirty way. I hope
> it's helpful.

Thanks!



>  typedef enum ExprEvalOp
>  {
> +    /*
> +     * HEIKKI: How about renaming these to EEOP_* ? I think it'd be a
> +     * good mnemonic to remember that these are opcodes, and just one
> +     * extra letter.
> +     */

I don't mind the change.


> +    /* HEIKKI: Is it safe to assume that sizeof(size_t) >= sizeof(void *) ? */

Yes, seems very likely - it's supposed to hold any potential sizeof()
return value.  You could end up on platforms where that's not true, if
it used tagged pointers or such.  I guess we could instead use a union,
but that doesn't seem that much of a benefit.


Greetings,

Andres Freund

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

14 March 2017, 20:34:10

On 2017-03-14 08:28:02 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> >     EEO_SWITCH(op->opcode)
> >     {
> > EEO_CASE(EEO_DONE):
> >         goto out;
> 
> Oh my.
> 
> > which is a bit annoying.  (the EEO_CASE is either a jump label or a case
> > statement, depending on computed goto availability).
> > 
> > It seems we could either:
> > 1) live with the damage
> > 2) disable pgindent
> > 3) move the : inside EEO_CASE's definition, and only use {} blocks.
> 
> I think 3) is nasty because the result doesn't look like normal C.

We have a bunch of such constructs already however. foreach(),
PG_TRY().  That means 3) has the advantage that our editors and pgindent
can already deal with it.


> EEO_CASE() are potentially jump labels, then indentation becomes
> correct.  Why not accept it?  It looks odd, but that gives a better hint
> to the reader about what's going on.

Seems likely that every editor would indent them differently :(


I'm leaning towards 3) for the reasons above and after trying out the
different indentations, but I don't have a strong opinion.


Greetings,

Andres Freund

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

14 March 2017, 20:52:10

On 2017-03-14 08:44:24 -0300, Alvaro Herrera wrote:
> Patch 0003 is huge.

I suspect you mean 0004?  If so - yes :(.  I unfortunately don't think
there's a useful way to split it in smaller chunks - I originally moved
ops over one-by-one, but that required a lot of duplicated structs and
such...

> It would be good to have someone at least read it before pushing, but
> I don't think anyone other than you has done so.

I'd love for somebody else
to look through it, I tried asking multiple times...  Seems like
threatening a commit is the best way to get it :P

- Andres

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Tom Lane

Date:

14 March 2017, 21:19:18

Andres Freund <andres@anarazel.de> writes:
> On 2017-03-14 08:44:24 -0300, Alvaro Herrera wrote:
>> It would be good to have someone at least read it before pushing, but
>> I don't think anyone other than you has done so.

> I'd love for somebody else
> to look through it, I tried asking multiple times...  Seems like
> threatening a commit is the best way to get it :P

I'm willing to look, but the last few messages make it sound like you're
not all that close to a finished version.
        regards, tom lane

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Heikki Linnakangas

Date:

14 March 2017, 21:28:51

On 03/14/2017 07:31 PM, Andres Freund wrote:
> On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote:
>> * How tight are we on space in the ExprEvalStep union? Currently, the
>> jump-threading preparation replaces the opcodes with the goto labels, but it
>> would be really nice to keep the original opcodes, for debugging purposes if
>> nothing else.
>
> There's nothing left to spare :(.  For debugging I've just used
> ExecEvalStepOp() - it's inconvenient to directly print the struct
> anyway, since gdb prints the whole union...

This comment above the union is not accurate:

>     /*
>      * Data for an operation. Inline stored data is faster to access, but also
>      * bloats the size of all instructions. The union should be kept below 48
>      * bytes (so the entire struct is below 64bytes, a single cacheline on
>      * common systems).
>      */

On my x86-64 system, the whole struct is 64 bytes, but the union is only 
40 bytes. The fields before the union occupy 24 bytes. IOW, the union 
should be kept below 40 bytes, not 48.

Hmm. Could we make the instructions variable size? It would allow 
packing the small instructions even more tight, and we wouldn't need to 
obsess over a particular maximum size for more complicated instructions.

A compromise might be to have the fixed size, but have "continuation" 
opcodes for the more complicated instructions. An XmlExpr instruction, 
for example, could have an extra instruction after the primary one, just 
to hold more fields. When evaluating it, we would just increment the 
Program Counter by two instead of one, to skip over the extra instruction.

For reference, here are the sizes of all the structs in the union:

40: xmlexpr
40: scalararrayop
40: minmax
40: iocoerce
40: convert_rowtype
32: wholerow
32: rowcompare_step
32: row
32: func
32: fieldstore
32: domaincheck
32: boolexpr
32: arrayexpr
32: arraycoerce
24: casewhen
24: casethen
24: arrayref_checksubscript
16: grouping_func
16: fieldselect
16: constval
16: casetest
16: assign_var
16: arrayref
8: window_func
8: subplan
8: sqlvaluefunction
8: param
8: nulltest_row
8: assign_tmp
8: alternative_subplan
8: aggref
4: var
4: rowcompare_final
4: qualexpr
4: fetch
4: coalesce

- Heikki

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

14 March 2017, 21:50:45

Hi,

On 2017-03-14 20:28:51 +0200, Heikki Linnakangas wrote:
> On 03/14/2017 07:31 PM, Andres Freund wrote:
> > On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote:
> > > * How tight are we on space in the ExprEvalStep union? Currently, the
> > > jump-threading preparation replaces the opcodes with the goto labels, but it
> > > would be really nice to keep the original opcodes, for debugging purposes if
> > > nothing else.
> > 
> > There's nothing left to spare :(.  For debugging I've just used
> > ExecEvalStepOp() - it's inconvenient to directly print the struct
> > anyway, since gdb prints the whole union...
> 
> This comment above the union is not accurate:
> 
> >     /*
> >      * Data for an operation. Inline stored data is faster to access, but also
> >      * bloats the size of all instructions. The union should be kept below 48
> >      * bytes (so the entire struct is below 64bytes, a single cacheline on
> >      * common systems).
> >      */
> 
> On my x86-64 system, the whole struct is 64 bytes, but the union is only 40
> bytes. The fields before the union occupy 24 bytes. IOW, the union should be
> kept below 40 bytes, not 48.

Point.  Will update.


> Hmm. Could we make the instructions variable size? It would allow packing
> the small instructions even more tight, and we wouldn't need to obsess over
> a particular maximum size for more complicated instructions.

That makes jumps a lot more complicated.  I'd experimented with it and
given it up as "not worth it".  If we were to try to do so, we'd also
not like storing the pointer and enum variants both, since it'd again
would reduce the density.


> A compromise might be to have the fixed size, but have "continuation"
> opcodes for the more complicated instructions. An XmlExpr instruction, for
> example, could have an extra instruction after the primary one, just to hold
> more fields. When evaluating it, we would just increment the Program Counter
> by two instead of one, to skip over the extra instruction.

I think for cases where 40 bytes becomes the limit, it's usually better
to have out-of-line state.  We could do this, but we'd also waste a
bunch of space in the array (namely the 24 byte Op "header").

Greetings,

Andres Freund

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Douglas Doole

Date:

15 March 2017, 01:03:45

Andres, sorry I haven't had a chance to look at this great stuff you've been doing. I've wanted to get to it, but work keeps getting in the way. ;-)

I do have one observation based on my experiments with your first version of the code. In my tests, I found that expression init becomes a lot more expensive in this new model. (That's neither a surprise, nor a concern.) In particular, the function ExprEvalPushStep() is quite hot. In my code I made the following changes:

* Declare ExprEvalPushStep() "inline".

* Remove the "if (es->steps_alloc == 0)" condition from ExprEvalPushStep().

* In ExecInitExpr(), add:

state->steps_alloc = 16;

state->steps = palloc(sizeof(ExprEvalStep) * es->steps_alloc);

I found that this cut the cost of initializing the expression by about 20%. (Of course, that was on version 1 of your code, so the benefit may well be different now.)

On Tue, Mar 14, 2017 at 11:51 AM Andres Freund <andres@anarazel.de> wrote:

> Hmm. Could we make the instructions variable size? It would allow packing
> the small instructions even more tight, and we wouldn't need to obsess over
> a particular maximum size for more complicated instructions.

That makes jumps a lot more complicated. I'd experimented with it and
given it up as "not worth it".

Back when I was at IBM, I spent a lot of time doing stuff like this. If you want to commit with the fixed size arrays, I'm happy to volunteer to look at packing it tighter as a follow-on piece of work. (It was already on my list of things to try anyhow.)

If we were to try to do so, we'd also
not like storing the pointer and enum variants both, since it'd again
would reduce the density.

From my experience, it's worth the small loss in density to carry around both the pointer and the enum - it makes debugging so much easier.

- Doug

Salesforce

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

15 March 2017, 01:16:48

Hi,

On 2017-03-14 22:03:45 +0000, Douglas Doole wrote:
> I do have one observation based on my experiments with your first version
> of the code. In my tests, I found that expression init becomes a lot more
> expensive in this new model. (That's neither a surprise, nor a
> concern.)

I suspect that's to a good degree because back then it'd often end up
trying the "new" stuff, but then fall back to the old machinery due to
unsupported operations. Which isn't a concern anymore, because now
there's full coverage.

Otherwise the cost aren't, according to my measurements, higher than
before, excepting that some stuff that happened at execution time is now
happening during initialization.

> In particular, the function ExprEvalPushStep() is quite hot. In my
> code I made the following changes:
> 
>   * Declare ExprEvalPushStep() "inline".
>   * Remove the "if (es->steps_alloc == 0)" condition
> from ExprEvalPushStep().
>   * In ExecInitExpr(), add:
>        state->steps_alloc = 16;
>        state->steps = palloc(sizeof(ExprEvalStep) * es->steps_alloc);
> 
> I found that this cut the cost of initializing the expression by about 20%.
> (Of course, that was on version 1 of your code, so the benefit may well be
> different now.)

Hm.  Right now ExprState's are allocated in several places - but we
could easily move that to a central place.  Have a bit of a hard time
seing that that branch during *initialization* time is that expensive,
especially given that previously we allocated a lot of memory separately
too.

> On Tue, Mar 14, 2017 at 11:51 AM Andres Freund <andres@anarazel.de> wrote:
> 
> > > Hmm. Could we make the instructions variable size? It would allow packing
> > > the small instructions even more tight, and we wouldn't need to obsess
> > over
> > > a particular maximum size for more complicated instructions.
> >
> > That makes jumps a lot more complicated.  I'd experimented with it and
> > given it up as "not worth it".

> Back when I was at IBM, I spent a lot of time doing stuff like this. If you
> want to commit with the fixed size arrays, I'm happy to volunteer to look
> at packing it tighter as a follow-on piece of work. (It was already on my
> list of things to try anyhow.)

Yea, I think experimenting with that is a good idea. I just think this
is complicated enough, and I don't want to add a whole lot more
complexity for very debatable benefit.

> > If we were to try to do so, we'd also
> > not like storing the pointer and enum variants both, since it'd again
> > would reduce the density.

> From my experience, it's worth the small loss in density to carry around
> both the pointer and the enum - it makes debugging so much easier.

I found it annoying enough to print the instruction itself, due to the
union - so printing the opcode using a function isn't too bad...

Greetings,

Andres Freund

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Douglas Doole

Date:

15 March 2017, 01:31:50

On Tue, Mar 14, 2017 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:

Hm. Right now ExprState's are allocated in several places - but we
could easily move that to a central place. Have a bit of a hard time
seing that that branch during *initialization* time is that expensive,
especially given that previously we allocated a lot of memory separately
too.

I didn't make any comparisons of the cost of the new init against the old init with this change in particular - I just saw that it made the new init faster. I also didn't play around to determine if the savings was found in removing the branch misprediction or inlining or both.

I certainly wouldn't hold up your commit for this, but it's something that might be worth a second look once the dust has settled.

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

15 March 2017, 01:47:06

Hi,

On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote:
> * execInterpExpr.c is quite a mouthful. How about exprInterpreter.c?

I applied most of your changes in the attached version.

Notes:
* I quite like how the EEO_JUMP/EEO_NEXT stuff looks now / after your
  change, that's clearly better.
* I reverted the removal of EEO_SWITCH() - pgindent goes a bit bonkers
  if it can see the switch(), but not the case: statements.
* I re-added ExecEvalStepOp() - for one it's hugely useful during
  debugging, for another I'm actually using it in a follow patch (So JIT
  compilation can run after the interpreter has been initialized).
  Added code guaranteeing that we can't run
  ExecInstantiateInterpretedExpr() twice.
* Renamed EEO_* opcode enum members to EEOP_* as suggested
* I've not yet renamed execInterpExpr.c - I don't like execInterpreter.c
  because it doesn't reference expressions and execExprInterpreter.c
  seems a bit long - but I can live with both (preferring the
  latter). Not that surprisingly I can also live with execInterpExpr.c ;)
* Addressed your size_t sizing concern by using, as is proper anyway,
  intptr_t.
* I tried to clarify the comment in execExpr.c's header that you marked
  as hard to understand.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Hi,

I got a test failure with this version of the patch in the postges_fdw. 
It looks to me like it was caused by a typo in the source code which is 
fixed in the attached patch.

After applying this patch check-world passes.

Andreas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

expression-processing-typo.patch

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Tom Lane

Date:

16 March 2017, 18:15:16

Andres Freund <andres@anarazel.de> writes:
> On 2017-03-15 20:09:03 -0400, Tom Lane wrote:
>> That scares me quite a bit, because it smells exactly like the sort of
>> premature optimization that bit us on the rear in CVE-2016-5423 (cf commit
>> f0c7b789a).

> I don't think there's a danger similar to f0c7b789a here, because the
> "caller" (i.e. the node that needs the expression's result) expects
> resvalue/null to be overwritten.

Yeah, that's what I thought when I wrote the broken code in ExecEvalCase,
too.  It was wrong.  Basically you've got to be sure that no aliasing
can occur, and I think the only way to be safe about that is to have a
very clear rule about where results are allowed to get returned to,
preferably one that doesn't ever re-use the same target.  (I think the
repeated use of the same subexpression result address for the arms of
an AND or OR is okay, but it would be a good idea to have a clear
statement of why.)

The thing that actually made the ExecEvalCase code into a bug was that
we were using ExprContext-level fields to store the current caseValue,
allowing aliasing to occur across nested CASEs.  I think that in
this implementation, it ought to be possible to get rid of
ExprContext.caseValue_datum et al altogether, in favor of some storage
location that's private to each CASE expression.  I'm a bit disappointed
that that doesn't seem to have happened.

Eventually, I would also like to find a way to remove the restriction
imposed by the other part of f0c7b789a, ie that we can't inline a SQL
function when that would result in intermixing two levels of CASE
expression.  An implementation along the lines of what I've sketched
above could handle that easily enough, as long as we could identify
which nested level of CASE a particular CaseTestExpr belongs to.
I don't know how to do that offhand, but it seems like it ought to be
soluble if we put a bit of thought into it.
        regards, tom lane

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Tom Lane

Date:

16 March 2017, 18:47:25

I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> I don't think there's a danger similar to f0c7b789a here, because the
>> "caller" (i.e. the node that needs the expression's result) expects
>> resvalue/null to be overwritten.

> Yeah, that's what I thought when I wrote the broken code in ExecEvalCase,
> too.  It was wrong.

Along the same line, I notice that you've got some expr step types
overwriting their own input, the various flavors of EEOP_BOOLTEST for
example.  Maybe that's all right but it doesn't really give me a warm
feeling, especially when other single-argument operations like
EEOP_BOOL_NOT_STEP are done differently.  Again, I think a clear
explanation of the design is essential to allow people to reason about
whether this sort of trickery is safe.
        regards, tom lane

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Tom Lane

Date:

17 March 2017, 18:36:30

Andres Freund <andres@anarazel.de> writes:
> [ latest patches ]

I looked through 0001 (the composite-type-dependencies one).  Although
I agree that it'd be good to tighten things up in that area, I do not
think we want this specific patch: it tightens things too much.  Consider
this variant of the existing test case in create_view.sql:

create table tt (f1 int, f2 int, f3 int);
create function ttf() returns setof tt as 'select * from tt' language sql;
create view vv as select f3 from ttf();
alter table tt drop column f3;

Current code allows the above (and causes the view to return nulls
afterwards).  The 0001 patch would forbid the DROP, which is fine,
but it would also forbid dropping either of the other two table
columns, which I think is insupportable.

Also, given the above table and function, consider

create view vvv as select ttf();

This view returns a 3-column composite type.  Now do

alter table tt add column f4 int;

Now the view returns a 4-column composite type.  But at this point the
patch will let you drop the f4 column, but not any of the earlier three.
That's just weird.

So I'm unhappy with the specific decisions made in 0001.  I think what we
really want there, probably, is for find_expr_references_walker to do more
than nothing with Vars referencing non-RELATION RTEs.

Having said all that, I think that 0001 is contributing very little to the
goals of this patch set.  Andres stated that he wanted it so as to drop
some of the one-time checks that execQual.c currently does for Vars, but
I'm not really convinced that we could do that safely even with these
additional dependencies in place.  Moreover, I really doubt that there's
a horrible performance cost from including something like
if (unlikely(op->first_execution))    out_of_line_checking_subroutine(...);

in the execution code for Vars.  And that certainly isn't adding any
complexity for JIT compilation that we don't face anyway for other
execution step types.

So my recommendation is to drop 0001 and include the same one-time
checks that execQual.c currently has as out-of-line one-time checks
in the new code.  We can revisit that later, but time grows short for
v10.  I would much rather have a solid version of 0004 and not 0001,
than not have anything for v10 because we spent too much time dealing
with adding new dependencies.
        regards, tom lane

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

17 March 2017, 19:33:41

Hi,

On 2017-03-17 11:36:30 -0400, Tom Lane wrote:
> Having said all that, I think that 0001 is contributing very little to the
> goals of this patch set.  Andres stated that he wanted it so as to drop
> some of the one-time checks that execQual.c currently does for Vars, but
> I'm not really convinced that we could do that safely even with these
> additional dependencies in place.  Moreover, I really doubt that there's
> a horrible performance cost from including something like
> 
>     if (unlikely(op->first_execution))
>         out_of_line_checking_subroutine(...);

> in the execution code for Vars.

But it actually does (well, for some relatively small value of horrible)
The issue is that op->first_execution is actually badly predictable,
because it will be set back/forth between executions of different
expressions (in the same plantree).  Obviously you'll not notice if you
have a Var and then some expensive stuff, but it's noticeable for
cheap-ish expressions (say e.g. a single Var).  So the branch prediction
often doesn't handle this gracefully - it also just expands the set of
to-be-tracked jumps.

If we had a decent way to actually check this during ExecInitExpr() (et
al), the whole discussion would be different - I'd be all expanding the
set of such checks even.  But I couldn't find a decent way to get there
- when expressions are initialized we don't even get an ExprContext (not
to speak of valid slots), nor is parent->plan very helpful.

That said, it seems this is something that has to wait for a later
release, I'm putting back in similar logic as there was before (not a
branch, but change the opcode to a non-checking variant).

> And that certainly isn't adding any
> complexity for JIT compilation that we don't face anyway for other
> execution step types.

Obviously a if (op->first_execution) isn't an issue, it's actually only
doing the first time through that's not easily possible.

> So my recommendation is to drop 0001 and include the same one-time
> checks that execQual.c currently has as out-of-line one-time checks
> in the new code.  We can revisit that later, but time grows short for
> v10.  I would much rather have a solid version of 0004 and not 0001,
> than not have anything for v10 because we spent too much time dealing
> with adding new dependencies.

Doing that (+README).

Greetings,

Andres Freund

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Tom Lane

Date:

17 March 2017, 20:18:56

Andres Freund <andres@anarazel.de> writes:
> That said, it seems this is something that has to wait for a later
> release, I'm putting back in similar logic as there was before (not a
> branch, but change the opcode to a non-checking variant).

Yeah, I was wondering if changing the opcode would be preferable to
a first-time flag.

>> So my recommendation is to drop 0001 and include the same one-time
>> checks that execQual.c currently has as out-of-line one-time checks
>> in the new code.  We can revisit that later, but time grows short for
>> v10.  I would much rather have a solid version of 0004 and not 0001,
>> than not have anything for v10 because we spent too much time dealing
>> with adding new dependencies.

> Doing that (+README).

OK.  I believe that we can get this committed after the documentation
problems are sorted.  I noticed a lot of small things that bugged me,
mostly sloppy comments, but I think that the most efficient way to
handle those is for me to make an editorial pass on your next version.
        regards, tom lane

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

20 March 2017, 06:48:38

Hi,

On 2017-03-16 11:15:16 -0400, Tom Lane wrote:
> The thing that actually made the ExecEvalCase code into a bug was that
> we were using ExprContext-level fields to store the current caseValue,
> allowing aliasing to occur across nested CASEs.  I think that in
> this implementation, it ought to be possible to get rid of
> ExprContext.caseValue_datum et al altogether, in favor of some storage
> location that's private to each CASE expression.  I'm a bit disappointed
> that that doesn't seem to have happened.

The patch actually does so - during ExecInitExprRec the "relevant"
case/domain testval is stored in ExprState->innermost_*, those pointers
are then stored directly in the relevant steps for
CaseTest/CoerceToDomainValue evaluation.  Unfortunately
CaseTest/CoerceToDomainValue are reused outside of domain / case
expressions in a bunch of places (plpgsql uses CaseTest for casts
evaluation, validateDomainConstraint/domain_check_input evaluate domain
constraints without a CoerceToDomain node).  I.e
ExprContext.caseValue_datum etc. aren't used for normal expressions
anymore, just for the ad-hoc hackery in a bunch of places.

I'd like to get rid of those usages, but that'd recurse into rewriting
plpgsql casts and other random pieces of code into a different approach
- something I'd like to avoid doing at the same as this already large
patch.

I've been pondering if we can't entirely get rid of CaseTest etc, the
amount of hackery required seems not like a good thing. One way I'd
prototyped was to replace them with PARAM_EXEC nodes - then the whole
issue of them potentially having different values at different parts of
an expression vanishes because the aliasing is removed.

> Eventually, I would also like to find a way to remove the restriction
> imposed by the other part of f0c7b789a, ie that we can't inline a SQL
> function when that would result in intermixing two levels of CASE
> expression.  An implementation along the lines of what I've sketched
> above could handle that easily enough, as long as we could identify
> which nested level of CASE a particular CaseTestExpr belongs to.
> I don't know how to do that offhand, but it seems like it ought to be
> soluble if we put a bit of thought into it.

I haven't thought overly much about this, but I agree, it looks like it
should be doable.

That seems to suggest my PARAM_EXEC idea isn't necessarily perfect -
inlining would cause aliasing again, but it'd also not hard to fix that
up.

- Andres

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Tom Lane

Date:

20 March 2017, 06:55:50

Andres Freund <andres@anarazel.de> writes:
> On 2017-03-16 11:15:16 -0400, Tom Lane wrote:
>> The thing that actually made the ExecEvalCase code into a bug was that
>> we were using ExprContext-level fields to store the current caseValue,
>> allowing aliasing to occur across nested CASEs.  I think that in
>> this implementation, it ought to be possible to get rid of
>> ExprContext.caseValue_datum et al altogether, in favor of some storage
>> location that's private to each CASE expression.  I'm a bit disappointed
>> that that doesn't seem to have happened.

> ...  Unfortunately
> CaseTest/CoerceToDomainValue are reused outside of domain / case
> expressions in a bunch of places (plpgsql uses CaseTest for casts
> evaluation, validateDomainConstraint/domain_check_input evaluate domain
> constraints without a CoerceToDomain node).

Yeah, there's various stuff that did that for expediency.

> I'd like to get rid of those usages, but that'd recurse into rewriting
> plpgsql casts and other random pieces of code into a different approach
> - something I'd like to avoid doing at the same as this already large
> patch.

Agreed, maybe we should just plan to clean that up later.

> I've been pondering if we can't entirely get rid of CaseTest etc, the
> amount of hackery required seems not like a good thing. One way I'd
> prototyped was to replace them with PARAM_EXEC nodes - then the whole
> issue of them potentially having different values at different parts of
> an expression vanishes because the aliasing is removed.

Yes, replacing all of that with Param slots had occurred to me too.
We might want to keep the special parse node types for convenience in
reverse-listing, but having them act just like PARAM_EXEC for execution
purposes seems promising.

>> Eventually, I would also like to find a way to remove the restriction
>> imposed by the other part of f0c7b789a, ie that we can't inline a SQL
>> function when that would result in intermixing two levels of CASE
>> expression.

> That seems to suggest my PARAM_EXEC idea isn't necessarily perfect -
> inlining would cause aliasing again, but it'd also not hard to fix that
> up.

Right, inlining would probably require some parameter-renumbering to avoid
aliasing.  But at least we'd have a clear framework for how to handle it,
whereas the way things are now, it's just impossible.
        regards, tom lane

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

20 March 2017, 07:16:21

Hi,


On 2017-03-19 23:55:50 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I've been pondering if we can't entirely get rid of CaseTest etc, the
> > amount of hackery required seems not like a good thing. One way I'd
> > prototyped was to replace them with PARAM_EXEC nodes - then the whole
> > issue of them potentially having different values at different parts of
> > an expression vanishes because the aliasing is removed.
> 
> Yes, replacing all of that with Param slots had occurred to me too.
> We might want to keep the special parse node types for convenience in
> reverse-listing, but having them act just like PARAM_EXEC for execution
> purposes seems promising.

As long as that special parse-time node is part of the same value
numbering, that makes sense (could just name make it a subtype of param
ala PARAM_CASE).  I don't think we actually do anything useful in
ruleutils etc with either CaseTest or CoerceToDomainValue.

- Andres

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

20 March 2017, 09:25:11

Hi,

On 2017-03-15 20:09:03 -0400, Tom Lane wrote:
> I think it would be worth creating a README file giving an overview
> of how all of this patch is supposed to work.  You also need to do a
> whole lot more work on the function-level comments.

I tried to improve upon both fronts.  I've added the higher level
explanation to executor/README, but I don't feel very strong about that.

I'm not quite sure it's exactly what you wanted however, the above ask
could also be understood to have more of an motivational angle,
describing why and what exactly is changed?  I'm also still not sure how
understandable it's for anybody that hasn't had their head in this for a
while...

I've also, as discussed nearby, re-added code akin to the checks in
ExecEvalScalarVar, with the discussed adjustment of the opcodes for
later executions.  Now that I've done that, I'm not sure I like that
approach that much - another alternative would be to change an
ExprState's evalfunc to ExecCheckAndExecute() after initialization.
That'd have the advantage to work nicely for JIT.  Either way, that can
trivially be changed later.

Additionally I added a regression test for the nearly entirely untested
nodeTidscan.c, after I'd broken it previously without noticing (thanks
Andreas).

I started a run through valgrind, without complaints up to
create_function_2. Can't run all of it (and contrib) right now without
prematurely running out of power on a plane.

Did I understand correctly that you'd rather just merge
ExecGetLastAttnums into execExpr.c, instead of making it globally
available?

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

I found a rather nasty bug :-( ... the comment in EEOP_INNER_VAR_FIRST about

+ /*
+ * Can't assert tts_nvalid, as wholerow var evaluation or such
+ * could have materialized the slot - but the contents are still
+ * valid :/
+ */
+ Assert(op->d.var.attnum >= 0);

is actually averting its eyes from a potentially serious problem. If we
go through ExecEvalWholeRowVar, which calls ExecFetchSlotTupleDatum, and
ExecFetchSlotTuple decides it has to materialize the tuple, then
ExecMaterializeSlot does this:
/* * Drop the pin on the referenced buffer, if there is one. */ if (BufferIsValid(slot->tts_buffer))
ReleaseBuffer(slot->tts_buffer);
slot->tts_buffer = InvalidBuffer;
/* * Mark extracted state invalid. This is important because the slot is * not supposed to depend any more on
theprevious external data; we * mustn't leave any dangling pass-by-reference datums in tts_values. * However, we
havenot actually invalidated any such datums, if there * happen to be any previously fetched from the slot. (Note
inparticular * that we have not pfree'd tts_mintuple, if there is one.) */ slot->tts_nvalid = 0;

The problem here is that once we drop the buffer pin, any pointers we may
have into on-disk data are dangling pointers --- we're at risk of some
other backend taking away that shared buffer. (So I'm afraid that the
commentary there is overly optimistic.) So even though those pointers
may still be there beyond tts_nvalid, subsequent references to them are
very dangerous.

I think that it's pretty hard to hit this in practice, maybe impossible,
because the normal case for an "on-disk" tuple is that
TTS_HAS_PHYSICAL_TUPLE is true, so that ExecFetchSlotTuple won't change
the state of the slot. If we have a virtual tuple that has to be
materialized, then by that very token it won't have a buffer pin to drop.
But I find this fragile as heck, and the aforesaid patch comment surely
isn't adequately documenting the safety considerations. Also, if there
ever were a live bug here, reproducing it would be damn hard because of
the low probability that a just-unpinned buffer would get replaced any
time soon. (Hm, I wonder whether the buffer cache needs something
analogous to the syscaches' CLOBBER_CACHE_ALWAYS behavior...)

Besides which, I really really don't like the lack of an "attnum <
tts_nvalid" assertion there; that's just obviously failing to check for
very simple bugs, such as getting the FETCHSOME steps wrong.

So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't
clobber the state of the slot. Right at the moment, the only way to do
that seems to be to do this instead of ExecFetchSlotTupleDatum:
tuple = ExecCopySlotTuple(slot); dtuple = (HeapTupleHeader) DatumGetPointer(heap_copy_tuple_as_datum(tuple,
slot->tts_tupleDescriptor)); heap_freetuple(tuple);

That's kind of annoying because of the double copying involved, but
tuptoaster.c doesn't expose any functionality for this except
heap_copy_tuple_as_datum(). I figure we can improve it later --- it looks
like we can refactor heap_copy_tuple_as_datum to expose a function that
reads from Datum/isnull arrays, and then call it on the slot's
tts_values/tts_isnull arrays. Seems like it might be a good idea to
think a bit harder about the TupleTableSlot APIs, too, and reduce the
number of cases where execTuples exposes destructive changes to the
state of a slot.

Also, while trying to test the above scenario, I realized that the patch
as submitted was being way too cavalier about where it was applying
CheckVarSlotCompatibility and so on. The ASSIGN_FOO_VAR steps, for
instance, had no protection at all. Think I have that all fixed up
though.

I hope to have a fully reviewed patch to pass back to you tomorrow.
Or Saturday at the latest.
regards, tom lane

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

24 March 2017, 03:43:23

Hi,m

On 2017-03-23 17:40:55 -0400, Tom Lane wrote:
> Stylistic thought ... I am wondering if it wouldn't be a good idea
> to replace EEOP_CASE_WHEN_STEP, EEOP_CASE_THEN_STEP, EEOP_COALESCE,
> and EEOP_ARRAYREF_CHECKINPUT with instructions defined in a less
> usage-dependent way as
> 
>     EEOP_JUMP        unconditional jump
>     EEOP_JUMP_IF_NULL    jump if step result is null
>     EEOP_JUMP_IF_NOT_NULL    jump if step result isn't null
>     EEOP_JUMP_IF_NOT_TRUE    jump if step result isn't TRUE
> 
> One could imagine later filling out this set with the other BoolTest
> condition types, but that seems to be all we need right now.

Hm, no arguments against, but I'm also not particularly excited about
the change.


> These are basically just renamings of the step types that exist now,
> although EEOP_ARRAYREF_CHECKINPUT would have to drop its not-very-
> necessary Assert(!op->d.arrayref.state->isassignment).

I won't shed a tear about that assert's removal.


> Well, I guess I should say that they're renamings of the semantics
> that I have for these steps in my working copy; for instance, I got
> rid of casewhen.value/casewhen.isnull in favor of letting CASE WHEN
> expressions evaluate into the CASE's final output variable.

That sounds like a sensible change (in the abstract, I obviously haven't
seen your working copy).


Greetings,

Andres Freund

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

24 March 2017, 03:47:34

On 2017-03-23 20:36:32 -0400, Tom Lane wrote:
> I found a rather nasty bug :-( ... the comment in EEOP_INNER_VAR_FIRST about
> 
> +            /*
> +             * Can't assert tts_nvalid, as wholerow var evaluation or such
> +             * could have materialized the slot - but the contents are still
> +             * valid :/
> +             */
> +            Assert(op->d.var.attnum >= 0);
> 
> is actually averting its eyes from a potentially serious problem.  If we
> go through ExecEvalWholeRowVar, which calls ExecFetchSlotTupleDatum, and
> ExecFetchSlotTuple decides it has to materialize the tuple, then
> ExecMaterializeSlot does this:
> 
>     /*
>      * Drop the pin on the referenced buffer, if there is one.
>      */
>     if (BufferIsValid(slot->tts_buffer))
>         ReleaseBuffer(slot->tts_buffer);
> 
>     slot->tts_buffer = InvalidBuffer;
> 
>     /*
>      * Mark extracted state invalid.  This is important because the slot is
>      * not supposed to depend any more on the previous external data; we
>      * mustn't leave any dangling pass-by-reference datums in tts_values.
>      * However, we have not actually invalidated any such datums, if there
>      * happen to be any previously fetched from the slot.  (Note in particular
>      * that we have not pfree'd tts_mintuple, if there is one.)
>      */
>     slot->tts_nvalid = 0;
> 
> 
> The problem here is that once we drop the buffer pin, any pointers we may
> have into on-disk data are dangling pointers --- we're at risk of some
> other backend taking away that shared buffer.  (So I'm afraid that the
> commentary there is overly optimistic.)  So even though those pointers
> may still be there beyond tts_nvalid, subsequent references to them are
> very dangerous.

This applies to the code in master as well, no?  An ExecEvalScalarVar()
followed by an ExecEvalWholeRowVar() would have precisely the same
effect?  Do we need to do anything about this in the back-branches,
given how unlikely this is going to be in practice?


> So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't
> clobber the state of the slot.

That seems like a good plan.


> Also, while trying to test the above scenario, I realized that the patch
> as submitted was being way too cavalier about where it was applying 
> CheckVarSlotCompatibility and so on.  The ASSIGN_FOO_VAR steps, for
> instance, had no protection at all.

I don't think that's true - the assign checks had copied the code from
the old ExecBuildProjectionInfo, setting isSimpleVar iff
(!attr->attisdropped && variable->vartype == attr->atttypid) - we can
check that for projections in contrast to normal expressions because we
already know the slot.  The relevant comment for that, from before the
patch, is: * inputDesc.  (Note: if there is a type mismatch then ExecEvalScalarVar * will probably throw an error at
runtime,but we leave that to it.) */
 


> I hope to have a fully reviewed patch to pass back to you tomorrow.
> Or Saturday at the latest.

Cool.

Greetings,

Andres Freund

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

24 March 2017, 04:00:19

On 2017-03-23 20:36:32 -0400, Tom Lane wrote:
> So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't
> clobber the state of the slot.  Right at the moment, the only way to do
> that seems to be to do this instead of ExecFetchSlotTupleDatum:
> 
>     tuple = ExecCopySlotTuple(slot);
>     dtuple = (HeapTupleHeader)
>         DatumGetPointer(heap_copy_tuple_as_datum(tuple,
>                                                  slot->tts_tupleDescriptor));
>     heap_freetuple(tuple);

Hm.  One disadvantage would be that repeated whole-row references to the
same table would be a bit slower, because we'd repeatedly form a tuple
from a virtual one - but I have a hard time coming up with a scenario
where that'd matter.  I'd suspect that in the end it'd probably even
have a *positive* performance impact, because right now the next scalar
access will have to deform the whole tuple again, and that seems like a
lot more likely scenario.

Greetings,

Andres Freund

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Tom Lane

Date:

24 March 2017, 04:26:19

Andres Freund <andres@anarazel.de> writes:
> On 2017-03-23 20:36:32 -0400, Tom Lane wrote:
>> The problem here is that once we drop the buffer pin, any pointers we may
>> have into on-disk data are dangling pointers --- we're at risk of some
>> other backend taking away that shared buffer.  (So I'm afraid that the
>> commentary there is overly optimistic.)  So even though those pointers
>> may still be there beyond tts_nvalid, subsequent references to them are
>> very dangerous.

> This applies to the code in master as well, no?  An ExecEvalScalarVar()
> followed by an ExecEvalWholeRowVar() would have precisely the same
> effect?

Yeah.  The other order would be safe, because ExecEvalScalarVar would do
slot_getattr which would re-extract the value from the newly materialized
tuple.  But there definitely seems to be a hazard for the order you
mentioned.

> Do we need to do anything about this in the back-branches,
> given how unlikely this is going to be in practice?

Probably not.  As I mentioned, I think this may be only theoretical rather
than real, if you believe that buffer pins would only be associated with
slots holding references to regular tuples.  And even if it's not
theoretical, the odds of seeing a failure in the field seem pretty tiny
given that a just-released buffer shouldn't be subject to recycling for
a fair while.  But I don't want to leave it like this going forward.

>> So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't
>> clobber the state of the slot.

> That seems like a good plan.

Yeah.  I have the stopgap code in my working copy, and will look at
refactoring the tuptoaster code for better performance later.

>> Also, while trying to test the above scenario, I realized that the patch
>> as submitted was being way too cavalier about where it was applying 
>> CheckVarSlotCompatibility and so on.  The ASSIGN_FOO_VAR steps, for
>> instance, had no protection at all.

> I don't think that's true - the assign checks had copied the code from
> the old ExecBuildProjectionInfo, setting isSimpleVar iff
> (!attr->attisdropped && variable->vartype == attr->atttypid) - we can
> check that for projections in contrast to normal expressions because we
> already know the slot.

Hmm, I see ... but that only works in the cases where the caller of
ExecBuildProjectionInfo supplied a source slot, and a lot of 'em don't.
As the code stands, we are unable to use ASSIGN_FOO_VAR in quite a lot
of places, including everywhere above the relation scan level.

I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST
step types.  I could take it back out, but I wonder if it wouldn't be
smarter to keep it and remove the restriction in ExecBuildProjectionInfo.
Or maybe we could have ExecBuildProjectionInfo emit either
ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove
the reference safe.
        regards, tom lane

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

24 March 2017, 04:34:12

On 2017-03-23 21:26:19 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-23 20:36:32 -0400, Tom Lane wrote:
> >> The problem here is that once we drop the buffer pin, any pointers we may
> >> have into on-disk data are dangling pointers --- we're at risk of some
> >> other backend taking away that shared buffer.  (So I'm afraid that the
> >> commentary there is overly optimistic.)  So even though those pointers
> >> may still be there beyond tts_nvalid, subsequent references to them are
> >> very dangerous.
> 
> > This applies to the code in master as well, no?  An ExecEvalScalarVar()
> > followed by an ExecEvalWholeRowVar() would have precisely the same
> > effect?
> 
> Yeah.  The other order would be safe, because ExecEvalScalarVar would do
> slot_getattr which would re-extract the value from the newly materialized
> tuple.  But there definitely seems to be a hazard for the order you
> mentioned.
> 
> > Do we need to do anything about this in the back-branches,
> > given how unlikely this is going to be in practice?
> 
> Probably not.  As I mentioned, I think this may be only theoretical rather
> than real, if you believe that buffer pins would only be associated with
> slots holding references to regular tuples.  And even if it's not
> theoretical, the odds of seeing a failure in the field seem pretty tiny
> given that a just-released buffer shouldn't be subject to recycling for
> a fair while.  But I don't want to leave it like this going forward.

Ok.


> >> Also, while trying to test the above scenario, I realized that the patch
> >> as submitted was being way too cavalier about where it was applying 
> >> CheckVarSlotCompatibility and so on.  The ASSIGN_FOO_VAR steps, for
> >> instance, had no protection at all.
> 
> > I don't think that's true - the assign checks had copied the code from
> > the old ExecBuildProjectionInfo, setting isSimpleVar iff
> > (!attr->attisdropped && variable->vartype == attr->atttypid) - we can
> > check that for projections in contrast to normal expressions because we
> > already know the slot.
> 
> Hmm, I see ... but that only works in the cases where the caller of
> ExecBuildProjectionInfo supplied a source slot, and a lot of 'em
> don't.

Right, the old and new code comment on that:
* inputDesc can be NULL, but if it is not, we check to see whether simple* Vars in the tlist match the descriptor.  It
isimportant to provide* inputDesc for relation-scan plan nodes, as a cross check that the relation* hasn't been changed
sincethe plan was made.  At higher levels of a plan,* there is no need to recheck.
 

and that seems like reasonable to me?  That said, I think we can remove
that assumption, by checking once.


> As the code stands, we are unable to use ASSIGN_FOO_VAR in quite a lot
> of places, including everywhere above the relation scan level.

Hm? If inputDesc isn't given we just, before and after, do:        if (!inputDesc)            isSimpleVar = true;
/* can't check type, assume OK */
 



> I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST
> step types.  I could take it back out, but I wonder if it wouldn't be
> smarter to keep it and remove the restriction in ExecBuildProjectionInfo.
> Or maybe we could have ExecBuildProjectionInfo emit either
> ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove
> the reference safe.

I think it's probably ok to just leave the check in, and remove those
comments, and simplify the isSimpleVar stuff to only check ifIsA(tle->expr, Var) && ((Var *) tle->expr)->varattno > 0)

- Andres

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Tom Lane

Date:

24 March 2017, 04:58:03

Andres Freund <andres@anarazel.de> writes:
> On 2017-03-23 21:26:19 -0400, Tom Lane wrote:
>> Hmm, I see ... but that only works in the cases where the caller of
>> ExecBuildProjectionInfo supplied a source slot, and a lot of 'em
>> don't.

> Right, the old and new code comment on that:

>  * inputDesc can be NULL, but if it is not, we check to see whether simple
>  * Vars in the tlist match the descriptor.  It is important to provide
>  * inputDesc for relation-scan plan nodes, as a cross check that the relation
>  * hasn't been changed since the plan was made.  At higher levels of a plan,
>  * there is no need to recheck.

Ah, I'd forgotten the assumption that we only need to check this at scan
level.

>> I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST
>> step types.  I could take it back out, but I wonder if it wouldn't be
>> smarter to keep it and remove the restriction in ExecBuildProjectionInfo.
>> Or maybe we could have ExecBuildProjectionInfo emit either
>> ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove
>> the reference safe.

> I think it's probably ok to just leave the check in, and remove those
> comments, and simplify the isSimpleVar stuff to only check if
>  IsA(tle->expr, Var) && ((Var *) tle->expr)->varattno > 0)

Not sure.  It's a pretty fair amount of duplicative code, once you finish
dealing with all the ExecJustFoo functions in addition to the main code
paths.  At this point I'm inclined to take it back out and improve the
comments around ExecBuildProjectionInfo.
        regards, tom lane

Re: [HACKERS] WIP: Faster Expression Processing v4

From

Andres Freund

Date:

24 March 2017, 05:00:03

On 2017-03-23 21:58:03 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-23 21:26:19 -0400, Tom Lane wrote:
> >> Hmm, I see ... but that only works in the cases where the caller of
> >> ExecBuildProjectionInfo supplied a source slot, and a lot of 'em
> >> don't.
> 
> > Right, the old and new code comment on that:
> 
> >  * inputDesc can be NULL, but if it is not, we check to see whether simple
> >  * Vars in the tlist match the descriptor.  It is important to provide
> >  * inputDesc for relation-scan plan nodes, as a cross check that the relation
> >  * hasn't been changed since the plan was made.  At higher levels of a plan,
> >  * there is no need to recheck.
> 
> Ah, I'd forgotten the assumption that we only need to check this at scan
> level.
> 
> >> I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST
> >> step types.  I could take it back out, but I wonder if it wouldn't be
> >> smarter to keep it and remove the restriction in ExecBuildProjectionInfo.
> >> Or maybe we could have ExecBuildProjectionInfo emit either
> >> ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove
> >> the reference safe.
> 
> > I think it's probably ok to just leave the check in, and remove those
> > comments, and simplify the isSimpleVar stuff to only check if
> >  IsA(tle->expr, Var) && ((Var *) tle->expr)->varattno > 0)
> 
> Not sure.  It's a pretty fair amount of duplicative code, once you finish
> dealing with all the ExecJustFoo functions in addition to the main code
> paths.  At this point I'm inclined to take it back out and improve the
> comments around ExecBuildProjectionInfo.

I'm ok with both.

- Andres

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

24 March 2017, 18:26:27

Another modest proposal:

I'm not really sold on the approach of using EEOP_FETCHSOME opcodes to
trigger initial tupleslot de-forming.  Certainly we want to have a single
slot_getsomeattrs call per source slot, but as-is, we need a separate
traversal over the expression tree just to precompute the max attribute
number needed.  That can't be good for expression compile speed, and
it introduces nontrivial bug risks because the code that does that
is completely decoupled from the code that emits the EEOP_VAR opcodes
(which are what's really relying on the de-forming to have happened).

What do you think about a design like this:

* Drop the FETCHSOME opcodes.

* Add fields to struct ExprState that will hold the maximum inner,
outer, and scan attribute numbers needed.

* ExecInitExpr initializes those fields to zero, and then during
ExecInitExprRec, whenever we generate an EEOP_VAR opcode, we do e.g.
state->last_inner_attno = Max(state->last_inner_attno,                              variable->varattno);

* ExecInitExprSlots, get_last_attnums_walker, etc all go away.

* In the startup segment of ExecInterpExpr, add
if (state->last_inner_attno > 0)    slot_getsomeattrs(innerslot, state->last_inner_attno);if (state->last_outer_attno >
0)   slot_getsomeattrs(outerslot, state->last_outer_attno);if (state->last_scan_attno > 0)
slot_getsomeattrs(scanslot,state->last_scan_attno);
 

This would be a little different from the existing code as far as runtime
branch-prediction behavior goes, but it's not apparent to me that it'd be
any worse.  Also, for JIT purposes it'd still be entirely possible to
compile the slot_getsomeattrs calls in-line; you'd just be looking to a
different part of the ExprState struct to find out what to do.
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

24 March 2017, 21:27:42

Hi,

On 2017-03-24 11:26:27 -0400, Tom Lane wrote:
> Another modest proposal:
> 
> I'm not really sold on the approach of using EEOP_FETCHSOME opcodes to
> trigger initial tupleslot de-forming.  Certainly we want to have a single
> slot_getsomeattrs call per source slot, but as-is, we need a separate
> traversal over the expression tree just to precompute the max attribute
> number needed.  That can't be good for expression compile speed, and
> it introduces nontrivial bug risks because the code that does that
> is completely decoupled from the code that emits the EEOP_VAR opcodes
> (which are what's really relying on the de-forming to have happened).

Hm.  We had the separate traversal for projections for a long while, and
I don't think there've been a a lot of changes to the extraction of the
last attribute number.  I'm very doubtful that the cost of traversing
the expression twice is meaningful in comparison to the other costs.


> What do you think about a design like this:
> 
> * Drop the FETCHSOME opcodes.
> 
> * Add fields to struct ExprState that will hold the maximum inner,
> outer, and scan attribute numbers needed.
> 
> * ExecInitExpr initializes those fields to zero, and then during
> ExecInitExprRec, whenever we generate an EEOP_VAR opcode, we do e.g.
> 
>     state->last_inner_attno = Max(state->last_inner_attno,
>                                   variable->varattno);
> 
> * ExecInitExprSlots, get_last_attnums_walker, etc all go away.
> 
> * In the startup segment of ExecInterpExpr, add
> 
>     if (state->last_inner_attno > 0)
>         slot_getsomeattrs(innerslot, state->last_inner_attno);
>     if (state->last_outer_attno > 0)
>         slot_getsomeattrs(outerslot, state->last_outer_attno);
>     if (state->last_scan_attno > 0)
>         slot_getsomeattrs(scanslot, state->last_scan_attno);
> 
> This would be a little different from the existing code as far as runtime
> branch-prediction behavior goes, but it's not apparent to me that it'd be
> any worse.

I'd be suprised if it weren't.

I'm not super strongly against this setup, but I fail to see the benefit
of whacking this around.  I've benchmarked the previous/current setup
fairly extensively, and I'd rather not redo that.  In contrast to the
other changes you've talked about, this definitely is in the hot-path...


> Also, for JIT purposes it'd still be entirely possible to compile the
> slot_getsomeattrs calls in-line; you'd just be looking to a different
> part of the ExprState struct to find out what to do.

Yea, we could do that.

Greetings,

Andres Freund

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

24 March 2017, 22:07:28

Andres Freund <andres@anarazel.de> writes:
> On 2017-03-24 11:26:27 -0400, Tom Lane wrote:
>> Another modest proposal:
>> 
>> I'm not really sold on the approach of using EEOP_FETCHSOME opcodes to
>> trigger initial tupleslot de-forming.  Certainly we want to have a single
>> slot_getsomeattrs call per source slot, but as-is, we need a separate
>> traversal over the expression tree just to precompute the max attribute
>> number needed.  That can't be good for expression compile speed, and
>> it introduces nontrivial bug risks because the code that does that
>> is completely decoupled from the code that emits the EEOP_VAR opcodes
>> (which are what's really relying on the de-forming to have happened).

> Hm.  We had the separate traversal for projections for a long while, and
> I don't think there've been a a lot of changes to the extraction of the
> last attribute number.

That's easily disproven just by looking at the code:
   /*    * Don't examine the arguments or filters of Aggrefs or WindowFuncs,    * because those do not represent
expressionsto be evaluated within the    * calling expression's econtext.  GroupingFunc arguments are never    *
evaluatedat all.    */   if (IsA(node, Aggref))       return false;   if (IsA(node, WindowFunc))       return false;
if(IsA(node, GroupingFunc))       return false;   return expression_tree_walker(node, get_last_attnums_walker,
                      (void *) info);
 

The WindowFunc exception hasn't been there so long, and the GroupingFunc
one is very new.  And who's to say whether e.g. the recent XMLTABLE patch
got this right at all?  We could easily be extracting columns we don't
need to.

I'm willing to leave this as-is for the moment, but I really think we
should look into changing it (after the main patch is in).
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

25 March 2017, 00:16:15

Attached is an updated patch.  I believe this is committable, but
since I whacked it around quite a bit, I'm sure you'll want to look
it over first.

Please be sure the commit message notes that function EXECUTE permissions
are now checked at executor startup not first call.  We need to document
that in the v10 release notes as an incompatibility, and I'm sure we'll
forget if the commit log doesn't point it out.

Some loose ends that remain to be looked at, though I do not think any
of these are reasons to postpone commit:

* I'm concerned that we now do not have enough check_stack_depth() calls.
In our off-list discussion I wanted to add one to ExecProcNode, and you
were unhappy about that ... but it'd still be a lot less stack checking
than we do now.

* I still think we should look into removing the EEOP_FETCHSOME op types,
or at least finding some other way to perform the calculation of the last
attnums in the mainline expression compilation path.

* As we discussed, ExecEvalWholeRowVar is now using a pretty inefficient
method for flattening tuples into datums.  I will take a to-do item to
fix this.

* ExecInitCheck is really just ExecInitExpr with a make_ands_explicit
call in front of it.  It turned out that most of the places that were
(or should have been) calling it were doing a make_ands_implicit call
for no other reason than to satisfy its API.  I changed most of those
to just call ExecInitExpr directly.  There are just a couple of call
sites left, and I think probably those are likewise not really that
happy with this API --- but I didn't take the time to chase down where
the expressions were coming from in those cases.  It seems possible
that we could remove ExecInitCheck/ExecPrepareCheck entirely.  I'll
look into this later, too.

* As we discussed, CaseTestValue/DomainValue are pretty messy and need
to be rethought.  That might not get done for v10 though.

* I'm not very happy that execSRF.c has two somewhat different
implementations of largely similar functionality, and
SetExprState.elidedFuncState seems like a wart.  That's mostly a
pre-existing problem of course.  I'm satisfied with leaving it as it is
for now, but eventually some refactoring there would be good.

The attached patch is against HEAD as of last night (commit 457a44487).

            regards, tom lane

Attachment

faster-expressions-v5.patch.gz

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

25 March 2017, 06:01:19

btw ... I just got around to looking at a code coverage report for this
patched version, and that reminded me of something I'd already suspected:
EEOP_INNER_SYSVAR and EEOP_OUTER_SYSVAR seem to be dead code.  That's
unsurprising, because we never try to access a tuple's system columns
above the scan level.  If a query asks for system columns, those get
passed up to upper query levels as ordinary user-side columns.

We could keep the execution support for those opcodes, or we could rip it
out and throw an error in execExpr.c if one would need to be generated.
I'm a little inclined to the latter, because it seems like the plan is
to grow more support for every opcode in the future.  We don't need to
be adding support for unreachable opcodes.
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

25 March 2017, 19:22:15

More random musing ... have you considered making the jump-target fields
in expressions be relative rather than absolute indexes?  That is,
EEO_JUMP would look like
    op += (stepno); \    EEO_DISPATCH(); \

instead of
    op = &state->steps[stepno]; \    EEO_DISPATCH(); \

I have not carried out a full patch to make this work, but just making
that one change and examining the generated assembly code looks promising.
Instead of this
movslq    40(%r14), %r8salq    $6, %r8addq    24(%rbx), %r8movq    %r8, %r14jmp    *(%r8)

we get this
movslq    40(%r14), %raxsalq    $6, %raxaddq    %rax, %r14jmp    *(%r14)

which certainly looks like it ought to be faster.  Also, the real reason
I got interested in this at all is that with relative jumps, groups of
steps would be position-independent within the steps array, which would
enable some compile-time tricks that seem impractical with the current
definition.

BTW, now that I've spent a bit of time looking at the generated assembly
code, I'm kind of disinclined to believe any arguments about how we have
better control over branch prediction with the jump-threading
implementation.  At least with current gcc (6.3.1 on Fedora 25) at -O2,
what I see is multiple places jumping to the same indirect jump
instruction :-(.  It's not a total disaster: as best I can tell, all the
uses of EEO_JUMP remain distinct.  But gcc has chosen to implement about
40 of the 71 uses of EEO_NEXT by jumping to the same couple of
instructions that increment the "op" register and then do an indirect
jump :-(.

So it seems that we're at the mercy of gcc's whims as to which instruction
dispatches will be distinguishable to the hardware; which casts a very
dark shadow over any benchmarking-based arguments that X is better than Y
for branch prediction purposes.  Compiler version differences are likely
to matter a lot more than anything we do.
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

26 March 2017, 01:11:42

Hi,

On 2017-03-24 17:16:15 -0400, Tom Lane wrote:
> Attached is an updated patch.  I believe this is committable, but
> since I whacked it around quite a bit, I'm sure you'll want to look
> it over first.

Indeed.

Points:

- It's going to take a while for me to get used to execExprInterp.c - I constantly try to switch to a nonexistant
buffer;)
 

- Like your approach with /* Check that current tupdesc doesn't have more fields than we allocated */ in
ExecEvalFieldStoreDeForm().

- I like EEO_FLAG_IS_QUAL.

- I think at some point (not sure whether in your or in my revision) ScalarArrayOpExpr lost its permission check. Added
that.

- Added note that we knowingly don't perform permission checks for input/output funcs.

- Both pre/post patch, CoerceViaIO doesn't invoke InvokeFunctionExecuteHook - I'm still unclear what that hook is
usefulfor, so ...
 

- The !caseExpr->defresult result branch is currently unreachable (and its equivalent was before the patch) because
transformCaseExpr()generates a default expression.  I'm inclined to replace the dead code with an assertion. Any reason
notto do that?
 

- I see you kept determination of the set of constraints to be checked for domain at initialization time. Made note
thatthat a) might change b) could change behaviour.
 


> Please be sure the commit message notes that function EXECUTE permissions
> are now checked at executor startup not first call.  We need to document
> that in the v10 release notes as an incompatibility, and I'm sure we'll
> forget if the commit log doesn't point it out.

Done.


> * As we discussed, ExecEvalWholeRowVar is now using a pretty inefficient
> method for flattening tuples into datums.  I will take a to-do item to
> fix this.

Thanks.


> * ExecInitCheck is really just ExecInitExpr with a make_ands_explicit
> call in front of it.  It turned out that most of the places that were
> (or should have been) calling it were doing a make_ands_implicit call
> for no other reason than to satisfy its API.  I changed most of those
> to just call ExecInitExpr directly.  There are just a couple of call
> sites left, and I think probably those are likewise not really that
> happy with this API --- but I didn't take the time to chase down where
> the expressions were coming from in those cases.  It seems possible
> that we could remove ExecInitCheck/ExecPrepareCheck entirely.  I'll
> look into this later, too.

Thanks^2.


> * As we discussed, CaseTestValue/DomainValue are pretty messy and need
> to be rethought.  That might not get done for v10 though.

Yea, I'm disinclined to tackle this just now, there's enough other stuff
pending - but I'm willing to tackle it for v11 unless you beat me to it.


> * I'm not very happy that execSRF.c has two somewhat different
> implementations of largely similar functionality, and
> SetExprState.elidedFuncState seems like a wart.  That's mostly a
> pre-existing problem of course.  I'm satisfied with leaving it as it is
> for now, but eventually some refactoring there would be good.

Yea, that's not pretty.  Given the historical difference in behaviour
between SRF and ROWS FROM (the latter always materializes, the former
only when the set function does so itself), I'm not sure they can easily
be simplified.

I'm not really sure how to get rid of the issue underlying
elidedFuncState?  I mean we could move that case into nodeFunctionscan,
above ExecMakeTableFunctionResult's level, but that doesn't really seem
like an improvement.


I think there's some argument to be made that we should move all of
execSRF.c's logic to their respective callsites.  There's not really
much shared code: ExecEvalFuncArgs(), tupledesc_match(), init_sexpr() -
and at least the latter would even become a bit simpler if we didn't
support both callers.


Thanks a lot for your work on the patch!

And pushed.

- Andres

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

26 March 2017, 01:17:55

Andres Freund <andres@anarazel.de> writes:
> - The !caseExpr->defresult result branch is currently unreachable (and
>   its equivalent was before the patch) because transformCaseExpr()
>   generates a default expression.  I'm inclined to replace the dead code
>   with an assertion. Any reason not to do that?

Ah-hah.  I'd noted that that code wasn't being reached when I did some
code coverage checks this morning, but I hadn't found the cause yet.
Yeah, replacing the if-test with an assert seems fine.

> I think there's some argument to be made that we should move all of
> execSRF.c's logic to their respective callsites.  There's not really
> much shared code: ExecEvalFuncArgs(), tupledesc_match(), init_sexpr() -
> and at least the latter would even become a bit simpler if we didn't
> support both callers.

Possibly.  All that code could stand to be rethought, probably, now that
its mission is just to handle the SRF case.  But at least all the cruft is
in one place for the moment.  And I don't think it's anything we have to
fix before v10, it's just cosmetic.

> Thanks a lot for your work on the patch!

You're welcome!
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

26 March 2017, 01:22:02

On 2017-03-25 12:22:15 -0400, Tom Lane wrote:
> More random musing ... have you considered making the jump-target fields
> in expressions be relative rather than absolute indexes?  That is,
> EEO_JUMP would look like
> 
>         op += (stepno); \
>         EEO_DISPATCH(); \
> 
> instead of
> 
>         op = &state->steps[stepno]; \
>         EEO_DISPATCH(); \
> 
> I have not carried out a full patch to make this work, but just making
> that one change and examining the generated assembly code looks promising.
> Instead of this
> 
>     movslq    40(%r14), %r8
>     salq    $6, %r8
>     addq    24(%rbx), %r8
>     movq    %r8, %r14
>     jmp    *(%r8)
> 
> we get this
> 
>     movslq    40(%r14), %rax
>     salq    $6, %rax
>     addq    %rax, %r14
>     jmp    *(%r14)

That seems like a good idea.  I've not done this in the committed
version (and I don't think we necessarily need to this before the
release), but fo rthe future it seems like a good plan.  It makes sense
that it's faster - there's no need to reference state->steps.

> which certainly looks like it ought to be faster.  Also, the real reason
> I got interested in this at all is that with relative jumps, groups of
> steps would be position-independent within the steps array, which would
> enable some compile-time tricks that seem impractical with the current
> definition.

Indeed.

> BTW, now that I've spent a bit of time looking at the generated assembly
> code, I'm kind of disinclined to believe any arguments about how we have
> better control over branch prediction with the jump-threading
> implementation.

I measured the performance difference between using it and not using it,
and it came out a pretty clear plus. On gcc 6.3, gcc master snapshot,
and clang-3.9.  It's not just that more jumps are duplicated, it's also
that the switch() always adds a boundary check.

> At least with current gcc (6.3.1 on Fedora 25) at -O2,
> what I see is multiple places jumping to the same indirect jump
> instruction :-(.  It's not a total disaster: as best I can tell, all the
> uses of EEO_JUMP remain distinct.  But gcc has chosen to implement about
> 40 of the 71 uses of EEO_NEXT by jumping to the same couple of
> instructions that increment the "op" register and then do an indirect
> jump :-(.

Yea, I see some of that too - "usually" when there's more than just the
jump in common.  I think there's some gcc variables that influence this
(min-crossjump-insns (5), max-goto-duplication-insns (8)).  Might be
worthwhile experimenting with setting them locally via a pragma or such.
I think Aants wanted to experiment with that, too.

Then there's also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785
which causes some forms of computed goto (not ours I think) to be
deoptimized in gcc.

Greetings,

Andres Freund

Re: WIP: Faster Expression Processing v4

From

Ants Aasma

Date:

26 March 2017, 02:56:11

On Sun, Mar 26, 2017 at 12:22 AM, Andres Freund <andres@anarazel.de> wrote:
>> At least with current gcc (6.3.1 on Fedora 25) at -O2,
>> what I see is multiple places jumping to the same indirect jump
>> instruction :-(.  It's not a total disaster: as best I can tell, all the
>> uses of EEO_JUMP remain distinct.  But gcc has chosen to implement about
>> 40 of the 71 uses of EEO_NEXT by jumping to the same couple of
>> instructions that increment the "op" register and then do an indirect
>> jump :-(.
>
> Yea, I see some of that too - "usually" when there's more than just the
> jump in common.  I think there's some gcc variables that influence this
> (min-crossjump-insns (5), max-goto-duplication-insns (8)).  Might be
> worthwhile experimenting with setting them locally via a pragma or such.
> I think Aants wanted to experiment with that, too.

I haven't had the time to research this properly, but initial tests
show that with GCC 6.2 adding

#pragma GCC optimize ("no-crossjumping")

fixes merging of the op tail jumps.

Some quick and dirty benchmarking suggests that the benefit for the
interpreter is about 15% (5% speedup on a workload that spends 1/3 in
ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
vars before the indirect jump is somewhere between a tiny benefit and
no effect, certainly not worth introducing extra complexity. Clang 3.8
does the correct thing out of the box and is a couple of percent
faster than GCC with the pragma.

Regards,
Ants Aasma

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

26 March 2017, 03:23:21


On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote:
>On Sun, Mar 26, 2017 at 12:22 AM, Andres Freund <andres@anarazel.de>
>wrote:
>>> At least with current gcc (6.3.1 on Fedora 25) at -O2,
>>> what I see is multiple places jumping to the same indirect jump
>>> instruction :-(.  It's not a total disaster: as best I can tell, all
>the
>>> uses of EEO_JUMP remain distinct.  But gcc has chosen to implement
>about
>>> 40 of the 71 uses of EEO_NEXT by jumping to the same couple of
>>> instructions that increment the "op" register and then do an
>indirect
>>> jump :-(.
>>
>> Yea, I see some of that too - "usually" when there's more than just
>the
>> jump in common.  I think there's some gcc variables that influence
>this
>> (min-crossjump-insns (5), max-goto-duplication-insns (8)).  Might be
>> worthwhile experimenting with setting them locally via a pragma or
>such.
>> I think Aants wanted to experiment with that, too.
>
>I haven't had the time to research this properly, but initial tests
>show that with GCC 6.2 adding
>
>#pragma GCC optimize ("no-crossjumping")
>
>fixes merging of the op tail jumps.
>
>Some quick and dirty benchmarking suggests that the benefit for the
>interpreter is about 15% (5% speedup on a workload that spends 1/3 in
>ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
>vars before the indirect jump is somewhere between a tiny benefit and
>no effect, certainly not worth introducing extra complexity. Clang 3.8
>does the correct thing out of the box and is a couple of percent
>faster than GCC with the pragma.

That's large enough to be worth doing (although I recall you seeing all jumps commonalized).  We should probably do
thison a per function basis however (either using pragma push option, or function attributes). 

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

26 March 2017, 06:51:45

Andres Freund <andres@anarazel.de> writes:
> On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote:
>> I haven't had the time to research this properly, but initial tests
>> show that with GCC 6.2 adding
>>
>> #pragma GCC optimize ("no-crossjumping")
>>
>> fixes merging of the op tail jumps.
>>
>> Some quick and dirty benchmarking suggests that the benefit for the
>> interpreter is about 15% (5% speedup on a workload that spends 1/3 in
>> ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
>> vars before the indirect jump is somewhere between a tiny benefit and
>> no effect, certainly not worth introducing extra complexity. Clang 3.8
>> does the correct thing out of the box and is a couple of percent
>> faster than GCC with the pragma.

> That's large enough to be worth doing (although I recall you seeing all jumps commonalized).  We should probably do
thison a per function basis however (either using pragma push option, or function attributes). 

Seems like it would be fine to do it on a per-file basis.  If you're
worried about pessimizing the out-of-line subroutines, we could move
those to a different file --- it's pretty questionable that they're
in execExprInterp.c in the first place, considering they're meant to be
used by more than just that execution method.
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

26 March 2017, 06:59:27

On 2017-03-25 23:51:45 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote:
> >> I haven't had the time to research this properly, but initial tests
> >> show that with GCC 6.2 adding
> >> 
> >> #pragma GCC optimize ("no-crossjumping")
> >> 
> >> fixes merging of the op tail jumps.
> >> 
> >> Some quick and dirty benchmarking suggests that the benefit for the
> >> interpreter is about 15% (5% speedup on a workload that spends 1/3 in
> >> ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
> >> vars before the indirect jump is somewhere between a tiny benefit and
> >> no effect, certainly not worth introducing extra complexity. Clang 3.8
> >> does the correct thing out of the box and is a couple of percent
> >> faster than GCC with the pragma.
> 
> > That's large enough to be worth doing (although I recall you seeing all jumps commonalized).  We should probably do
thison a per function basis however (either using pragma push option, or function attributes).
 
> 
> Seems like it would be fine to do it on a per-file basis.

I personally find per-function annotation ala
__attribute__((optimize("no-crossjumping")))
cleaner anyway.  I tested that, and it seems to work.

Obviously we'd have to hide that behind a configure test.  Could also do
tests based on __GNUC__ / __GNUC_MINOR__, but that seems uglier.


> If you're
> worried about pessimizing the out-of-line subroutines, we could move
> those to a different file --- it's pretty questionable that they're
> in execExprInterp.c in the first place, considering they're meant to be
> used by more than just that execution method.

I indeed am, but having the code in the same file has a minor advantage:
It allows the compiler to partially inline them, if it feels like it
(e.g. moving null checks inline).

Greetings,

Andres Freund

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

27 March 2017, 08:43:14

On 2017-03-25 20:59:27 -0700, Andres Freund wrote:
> On 2017-03-25 23:51:45 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote:
> > >> I haven't had the time to research this properly, but initial tests
> > >> show that with GCC 6.2 adding
> > >> 
> > >> #pragma GCC optimize ("no-crossjumping")
> > >> 
> > >> fixes merging of the op tail jumps.
> > >> 
> > >> Some quick and dirty benchmarking suggests that the benefit for the
> > >> interpreter is about 15% (5% speedup on a workload that spends 1/3 in
> > >> ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
> > >> vars before the indirect jump is somewhere between a tiny benefit and
> > >> no effect, certainly not worth introducing extra complexity. Clang 3.8
> > >> does the correct thing out of the box and is a couple of percent
> > >> faster than GCC with the pragma.
> > 
> > > That's large enough to be worth doing (although I recall you seeing all jumps commonalized).  We should probably
dothis on a per function basis however (either using pragma push option, or function attributes).
 
> > 
> > Seems like it would be fine to do it on a per-file basis.
> 
> I personally find per-function annotation ala
> __attribute__((optimize("no-crossjumping")))
> cleaner anyway.  I tested that, and it seems to work.
> 
> Obviously we'd have to hide that behind a configure test.  Could also do
> tests based on __GNUC__ / __GNUC_MINOR__, but that seems uglier.

Checking for this isn't entirely pretty - see my attached attempt at
doing so.  I considered hiding
__attribute__((optimize("no-crossjumping"))) in execInterpExpr.c behind
a macro (like PG_DISABLE_CROSSJUMPING), but I don't really think that
makes things better.

Comments?

Greetings,

Andres Freund

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

27 March 2017, 16:33:43

Andres Freund <andres@anarazel.de> writes:
>> I personally find per-function annotation ala
>> __attribute__((optimize("no-crossjumping")))
>> cleaner anyway.  I tested that, and it seems to work.
>> 
>> Obviously we'd have to hide that behind a configure test.  Could also do
>> tests based on __GNUC__ / __GNUC_MINOR__, but that seems uglier.

Agreed.

> Checking for this isn't entirely pretty - see my attached attempt at
> doing so.  I considered hiding
> __attribute__((optimize("no-crossjumping"))) in execInterpExpr.c behind
> a macro (like PG_DISABLE_CROSSJUMPING), but I don't really think that
> makes things better.

I think it would, primarily because if we find out that some other compiler
spells this differently, we could handle it totally within configure.

Isn't our practice to put __attribute__ at the end of a function
declaration or definition, not in the middle someplace?
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

27 March 2017, 18:17:50

On 2017-03-27 09:33:43 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Checking for this isn't entirely pretty - see my attached attempt at
> > doing so.  I considered hiding
> > __attribute__((optimize("no-crossjumping"))) in execInterpExpr.c behind
> > a macro (like PG_DISABLE_CROSSJUMPING), but I don't really think that
> > makes things better.
> 
> I think it would, primarily because if we find out that some other compiler
> spells this differently, we could handle it totally within configure.

I'm unconvinced that we could sensibly map different compiler's options
on the same option name - I'd be surprised if they would have a similar
enough effect.


> Isn't our practice to put __attribute__ at the end of a function
> declaration or definition, not in the middle someplace?

We could move it to the declaration - which doesn't seem like an
improvement here, given that it's about optimization not API changes -
but in definitions you can't move it after the function name:
static Datum
ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
#ifdef HAVE_FUNCATTR_NO_CROSSJUMPING
__attribute__((optimize("no-crossjumping")))
#endif
:
/home/andres/src/postgresql/src/backend/executor/execExprInterp.c:279:1: error: attributes should be specified before
thedeclarator in a function definition
 


Greetings,

Andres Freund

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

27 March 2017, 18:22:40

Andres Freund <andres@anarazel.de> writes:
> On 2017-03-27 09:33:43 -0400, Tom Lane wrote:
>> I think it would, primarily because if we find out that some other compiler
>> spells this differently, we could handle it totally within configure.

> I'm unconvinced that we could sensibly map different compiler's options
> on the same option name - I'd be surprised if they would have a similar
> enough effect.

[ shrug... ]  OK, next question is whether pgindent treats this
layout sanely.
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

27 March 2017, 18:29:14

On 2017-03-27 11:22:40 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-27 09:33:43 -0400, Tom Lane wrote:
> >> I think it would, primarily because if we find out that some other compiler
> >> spells this differently, we could handle it totally within configure.
> 
> > I'm unconvinced that we could sensibly map different compiler's options
> > on the same option name - I'd be surprised if they would have a similar
> > enough effect.
> 
> [ shrug... ]  OK, next question is whether pgindent treats this
> layout sanely.

The patch was run through pgindent, i.e. it seems to be ok with it.

- Andres

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

27 March 2017, 18:52:05

As to the point of whether it actually helps or not ...

on gcc 6.3.1 (Fedora 25), yes it does.  Seems to be one "jmp *something"
per EEO_NEXT or EEO_JUMP.

on gcc 4.4.7 (RHEL 6), it makes things *WORSE*.  We go from about half of
the dispatches getting routed through a common location, to *all* of them
(except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF
survives as a separate jump).  This seems like a bug, but there it is.

So this means we'd need some serious research to decide whether to apply
it.  And I'm suspecting we'd end up with a compiler version test.
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

27 March 2017, 19:18:37

I wrote:
> As to the point of whether it actually helps or not ...
> on gcc 4.4.7 (RHEL 6), it makes things *WORSE*.  We go from about half of
> the dispatches getting routed through a common location, to *all* of them
> (except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF
> survives as a separate jump).  This seems like a bug, but there it is.

So after a bit of googling, this is a very longstanding complaint:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39284
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785

(hm, think I know the second submitter)

I'm not sure we should be relying on a gcc bug fix that's barely four
months old.  Even assuming it fixes the issue without new regressions,
most people are not going to have it anytime soon.

My feeling at this point is that we might be better off disabling
the computed-goto case by default.  At the very least, we're going
to need a version check that restricts it to latest gcc.
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

27 March 2017, 19:18:59

On 2017-03-27 11:52:05 -0400, Tom Lane wrote:
> As to the point of whether it actually helps or not ...
> 
> on gcc 6.3.1 (Fedora 25), yes it does.  Seems to be one "jmp *something"
> per EEO_NEXT or EEO_JUMP.
> 
> on gcc 4.4.7 (RHEL 6), it makes things *WORSE*.  We go from about half of
> the dispatches getting routed through a common location, to *all* of them
> (except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF
> survives as a separate jump).  This seems like a bug, but there it is.

Gah :(.  I have gcc 5,6,7 here, but nothing earlier.  I'm now inclined
to go the version check routine, for the individual versions we can (and
want) confirm this on...  I'm not too concerned about not doing so on
gcc 4.4 or older...


> So this means we'd need some serious research to decide whether to apply
> it.  And I'm suspecting we'd end up with a compiler version test.

Indeed.

- Andres

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

27 March 2017, 19:38:38

On 2017-03-27 12:18:37 -0400, Tom Lane wrote:
> I wrote:
> > As to the point of whether it actually helps or not ...
> > on gcc 4.4.7 (RHEL 6), it makes things *WORSE*.  We go from about half of
> > the dispatches getting routed through a common location, to *all* of them
> > (except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF
> > survives as a separate jump).  This seems like a bug, but there it is.
> 
> So after a bit of googling, this is a very longstanding complaint:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39284
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785
> 
> (hm, think I know the second submitter)

I don't think that's precisely the same issue - as long as some of the
goto branches survive, we're not hitting the full brunt of the compgoto
thing.  I think we're essentially avoiding some of that because we're
"precomputing" the dispatch_table lookup.

To count the number of jumps I used:
gdb  -batch -ex 'disassemble/s ExecInterpExpr' execExprInterp.o|grep jmpq|grep -v ExecInterpExpr|wc -l
which'll only include indirect jumps (since otherwise jumps will look
like "jmpq   0x35f2 <ExecInterpExpr+2066>")
        standard flags        -fno-crossjumping
gcc-5 (5.4.1-8):    34            82
gcc-6 (6.3.0-8):    34            82
gcc-7 (7.0.1):        71            108
gcc-snapshot:        72            108

So that doesn't look too bad.


> My feeling at this point is that we might be better off disabling
> the computed-goto case by default.  At the very least, we're going
> to need a version check that restricts it to latest gcc.

In my measurements it's still faster in at least gcc-5/6, even without
the option (largely because it avoids array bounds checks on the jump
table built for the switch).

Greetings,

Andres Freund

Re: WIP: Faster Expression Processing v4

From

Tom Lane

Date:

27 March 2017, 20:30:11

Andres Freund <andres@anarazel.de> writes:
> On 2017-03-27 12:18:37 -0400, Tom Lane wrote:
>> My feeling at this point is that we might be better off disabling
>> the computed-goto case by default.  At the very least, we're going
>> to need a version check that restricts it to latest gcc.

> In my measurements it's still faster in at least gcc-5/6, even without
> the option (largely because it avoids array bounds checks on the jump
> table built for the switch).

Hm.  What test cases are you using?
        regards, tom lane

Re: WIP: Faster Expression Processing v4

From

Andres Freund

Date:

27 March 2017, 20:33:40

On 2017-03-27 13:30:11 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-27 12:18:37 -0400, Tom Lane wrote:
> >> My feeling at this point is that we might be better off disabling
> >> the computed-goto case by default.  At the very least, we're going
> >> to need a version check that restricts it to latest gcc.
> 
> > In my measurements it's still faster in at least gcc-5/6, even without
> > the option (largely because it avoids array bounds checks on the jump
> > table built for the switch).
> 
> Hm.  What test cases are you using?

I used tpc-h - seems like a realistic enough testcase.

Andres