Thread: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Andres Freund
Date:
Hi Everyone,

TL;DR: Making things faster. Architectural evalation.

as some of you might be aware I've been working on making execution of
larger queries in postgresl faster. While working on "batched execution"
I came to the conclusion that, while necessary, isn't currently showing
a large benefit because expression evaluation and tuple deforming are
massive bottlenecks.

I'm posting a quite massive series of WIP patches here, to get some
feedback.

Tuple deforming is slow because of two reasons:

1) It's the first thing that accesses tuples, i.e. it'll often incur  cache misses. That's partially fundamental, but
alsopartially can be  addressed, e.g. through changing the access order in heap as in [1].
 
2) Tuple deforming has a lot of unpredicatable branches, because it has  to cope with various types of fields. We e.g.
performalignment in a  lot of unneeded cases, do null checks for NOT NULL columns et al.
 

I tried to address 2) by changing the C implementation. That brings some
measurable speedups, but it's not huge. A bigger speedup is making
slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers;
but it's still not huge.  Finally I turned to just-in-time (JIT)
compiling the code for tuple deforming. That doesn't save the cost of
1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01).  The
first part is done in 0008, the JITing in 0012.


Expression evaluation and projection is another major bottleneck.
1) Our recursive expression evaluation puts a *lot* of pressure on the  stack.
2) There's a lot of indirect function calls when recursing to other  expression nodes. These are hard to predict,
becausethe same node  type (say ExecEvalAnd()) is used in different parts of an expression  tree, and invokes different
sub-nodes.
3) The function calls to operators and other functions are hard to  predict, leading to a significant number of
pipelinestalls.
 
4) There's a fair amount of pg_list.h list style iteration going on,  those are cache and pipeline inefficient.

After some experimenting I came to the conclusion that the recursive
processing is a fundamental impediment to making this faster.  I've
converted (0006) expression processing and projection into an opcode
dispatch based interpreter. That yields, especially for complex
expressions and larger projections a significant speedup in itself.  But
similarly to the deforming, expression evaluation remains a bottleneck
after that, primarily because there's still a lot of unpredictable jump
and calls, and because loads/stores have to be complex
(e.g. ExprContext->ecxt_innertuple->tts_values[i]/tts_isnull[i] for a
single scalar var evaluation).   Using the opcode based representation
of expression evaluation (as it's nearly linear, and has done a lot of
the lookups ahead of time), it's actually quite easy to


*After JITing expression evaluation itself is more than ten times faster
than before*.

But unfortunately that doesn't mean that queries are ten times faster -
usually we'll hit bottlenecks elsewhere relatively soon.  WRT to
expression evaluation, the biggest cost afterwards are the relatively
high overhead V1 function calls - register based parameter passing is a
lot faster.


After experimenting a bit with doing JITing manually (a lot of
eye-stabbing kind of fun), I chose to use LLVM.

An overview of the patch-queue so far:
0001  Make get_last_attnums more generic.

Boring prerequisite.

0002  More efficient AggState->pertrans iteration.

Relatively boring minor optimization, but it turns out to be a easily
hit bottleneck. Will commit independently.


0003  Avoid materializing SRFs in the FROM list.
0004  Allow ROWS FROM to return functions as single record column.
0005  Basic implementation of targetlist SRFs via ROWS FROM.
0006  Remove unused code related to targetlist SRFs.

These are basically just pre-requisites for the faster expression
evaluation, and discussed elsewhere [2].  This implementation is *NOT*
going to survive, because we ended coming to the conclusion that using a
separate executor node to expand SRFs is a btter plan. But the new
expression evaluation code won't be able to handle SRFs...


0007  WIP: Optimize slot_deform_tuple() significantly.

This a) turns tuple deforming into an opcode based dispatch loop (using
computed goto on gcc/clang). b) moves a lot of the logic from
slot_deform_tuple() callsites into itself - that turns out to be more
efficient.  I'm not entirely sure it's worth doing the opcode based
dispatch part, if we're going to also do the JIT bit - it's a fair
amount of code, and the speed difference only matters on large amounts
of rows.


0008  WIP: Faster expression processing and targetlist projection.

This, functionally nearly complete, patch turns expression evaluation
(and tuple deforming as a special case of that) into a "mini language"
which is interpreted using either a while(true) switch(opcode) or
computed goto to jump from opcode to opcode.  It does so by moving a lot
more of the code for expression evaluation to initialization time and
building a linear series of steps to evaluate expressions, thereby
removing all recursion from expression processing.

This nearly entirely gets rid of the stack usage cost of expression
evaluation (we pretty much never recurse except for subplans). Being
able to remove, now redundant, calls to check_stack_depth() is a
noticeable benefit, it turns out that that check has a noticeable
performance impact (as it aparently forces to actually use the stack,
instead of just renumbering registers inside the CPU).

The new representation and evaluation is functionally nearly complete
(there's a single regression test failure, and I know why that is), but
the code needs a fair amount of polishing.

I do absolutely think that the fundamentals of this are the right way to
go, and I'm going to work hard on polishing the patch up.  But this
isn't something that we can easily do in parts, and it's a huge ass
patch. So I'd like to have at least some more buyin before wasting even
more time on this.


0009  WIP: Add minimal keytest implementation.

More or less experimental patch that tries to implement simple
expression of the OpExpr(ScalarVar, Const) into a single expression
evaluation step.  The benefits probably aren't big enough iff we do end
up doing JITing of expressions.


0010  WIP: Add configure infrastructure to enable LLVM.
0011  WIP: Beginning of a LLVM JIT infrastructure.

Very boring preliminary patches to add --with-llvm and some minimal
infrastructure to handle LLVM. If we go this way, JITed stuff needs to
be tied to resource owners, and we need some other centralized
infrastructure.

0012  Heavily-WIP: JITing of tuple deforming.

This, in a not-yet-that-nice manner, implements a JITed version of the
per-column stuff that slot_deform_tuple() does.  It currently always
deforms all columns, which obviously would have to change. There's also
considerable additional performance improvements possible.

With this patch the per-column overhead (minus bitmap handling, which
0007 moved into a separate loop), drops from 10%+ into low single digits
for a number of queries.  Afterwards the biggest cost is VARSIZE_ANY()
for varlena columns (which atm isn't inlined).  That is, besides the
initial cache-miss when accessing tuple->t_hoff, which JITing can do
nothing about :(

This can be enabled/disabled using the new jit_tuple_deforming GUC.  To
make this production ready in some form, we'd have to come up with a way
to determine when it's worth doing JITing. The easiest way would be to
do so after N slot_deform_tuple() calls or such, another way would be to
do it based on cost estimates.


0013  WIP: ExprEval: Make threaded dispatch use a separate field.

Boring preliminary patch. Increases memory usage a bit, needs to be
thought through more.


0014  Heavily-WIP: JITed expression evaluation.

This is the most-interesting bit performance wise. A few common types of
expressions are JITed. Scalar value accesses, function calls, boolean
expressions, aggregate references.

This can be enabled using the new jit_expressions GUC.

Even for the supported expression types I've taken some shortcuts
(e.g. strict functions aren't actually strict).

The performance benefits are quite noticeable. For TPCH ExecEvalExpr()
(which is where 0008 moved all of expression evaluation/projection) goes
from being the top profile entry, to barely noticeable, with the JITed
function usually not showing up in the top five entries anymore.

After the patch it becomes very clear that our function call
infrastructure is a serious bottlenecks. Passing all the arguments via
memory, and, even worse, forcing isnull/values to be on separate
cachelines, has significant performance implications.  It also becomes
quite noticeable that nodeAgg's transition function invocation doesn't
go through ExecEvalExpr() but does that itself - which leads to constant
mispredictions if several transition values exist.

While the JIT code is relatively verbose, it turns out to not actually
be that hard to write after some startup pains. All the JITing of
expressions that exists so far was basically written in ~10 hours.

This also needs some heuristics about when JITing is
appropriate. Compiling an expression that's only executed once is never
going to be faster than doing the interpretation (it at least needs a
writable allocation for the code, and then a remap to make that code
read-only and executable).  A trace based approach (everything executed
at least a thousand times) or cost based (all queries costing more than
100000 should be JITed) could make sense.


It's worthwhile to note that at the moment this is a per-query-execution
JIT, not something that can trivially be cached for prepared
statements. That'll need further infrastructure.


0015  Super-Heavily-WIP: LLVM perf integration.

This very very very preliminary patch (including some copy-pasted GPL
code!) creates /proc/perf-<pid>.map files, which allows perf to show
useful symbols for profile hits to JIT expressions.  I plan to push this
towards LLVM, so this isn't something PG will have to do, but it's
helpful for evaluation.


I eventually plan to start separate threads about some of the parts in
here, but I think the overal picture needs some discussion first.


Q: Why LLVM and not a hand-rolled JIT?
A: Because hand-rolling a JIT is probably hard to scale to multiple  maintainers, and multiple platforms. I started
downthe path of doing  a hand-rolled x86 JIT, and that'd also be doable (faster compilation,  slower execution
basically);but I doubt we'd end up having that on  different architectures on platforms. Not to speak of things like
properdebugger and profiler integration.  I'm not entirely convinced  that that's the right path. It might also be a
transitionalstep,  towards doing our completely own JIT. But I think it's a sensible  step.
 

Q: Why LLVM and not $jit-toolkit
A: Because all the other JIT stuff I looked at was either really  unportable (mostly x86 linux only), inconveniently
licensed(like  e.g. gcc's jit library) or nearly unmaintained (luajit's stuff for  example).  I might have missed
something,but ISTM that atm the  choice is between hand-rolling and using LLVM.
 

Q: Does this actually inline functions from the backend?
A: No. That probably is something desirable in the future, but to me  that seems like it should be a separate step. The
currentone's big  enough. It's also further increases compilation times, so quite  possibly we only want to do so based
onanother set of heuristics.
 

Q: ?

Comments? Questions?

Regards,

Andres

[1] https://archives.postgresql.org/message-id/20161030073655.rfa6nvbyk4w2kkpk%40alap3.anarazel.de
[2] https://www.postgresql.org/message-id/20160523005327.v2tr7obytitxcnna@alap3.anarazel.de



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> I'm posting a quite massive series of WIP patches here, to get some
> feedback.

I guess the $64 question that has to be addressed here is whether we're
prepared to accept LLVM as a run-time dependency.  There are some reasons
why we might not be:

* The sheer mass of the dependency.  What's the installed footprint of
LLVM, versus a Postgres server?  How hard is it to install from source?

* How will we answer people who say they can't accept having a compiler
installed on their production boxes for security reasons?

* Are there any currently-interesting platforms that LLVM doesn't work
for?  (I'm worried about RISC-V as much as legacy systems.)


I concur with your feeling that hand-rolled JIT is right out.  But
I'm not sure that whatever performance gain we might get in this
direction is worth the costs.
        regards, tom lane



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Robert Haas
Date:
On Tue, Dec 6, 2016 at 1:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
>> I'm posting a quite massive series of WIP patches here, to get some
>> feedback.
>
> I guess the $64 question that has to be addressed here is whether we're
> prepared to accept LLVM as a run-time dependency.  There are some reasons
> why we might not be:
>
> * The sheer mass of the dependency.  What's the installed footprint of
> LLVM, versus a Postgres server?  How hard is it to install from source?
>
> * How will we answer people who say they can't accept having a compiler
> installed on their production boxes for security reasons?
>
> * Are there any currently-interesting platforms that LLVM doesn't work
> for?  (I'm worried about RISC-V as much as legacy systems.)

I think anything that requires LLVM -- or, for that matter, anything
that does JIT by any means -- has got to be optional.  But I don't
think --with-llvm as a compile option is inherently problematic.
Also, I think this is probably a direction we need to go.  I've heard
at least one and maybe several PGCon presentations about people JITing
tuple deformation and getting big speedups, and I'd like to finally
hear one from somebody who intends to integrate that into PostgreSQL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Andres Freund
Date:
Hi,


On 2016-12-06 13:56:28 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I'm posting a quite massive series of WIP patches here, to get some
> > feedback.
> 
> I guess the $64 question that has to be addressed here is whether we're
> prepared to accept LLVM as a run-time dependency.  There are some reasons
> why we might not be:

Indeed.  It'd only be a soft dependency obviously.


> * The sheer mass of the dependency.  What's the installed footprint of
> LLVM, versus a Postgres server?  How hard is it to install from source?

Worked for me first try, but I'm perhaps not the best person to judge.
It does take a while to compile though (~20min on my laptop).


> * How will we answer people who say they can't accept having a compiler
> installed on their production boxes for security reasons?

I think they'll just not enable --with-llvm in that case, and get
inferior performance. Note that installing llvm does not imply
installing a full blown C compiler (although I think that's largely
moot, you can get pretty much the same things done with just compiling
LLVM IR).


> * Are there any currently-interesting platforms that LLVM doesn't work
> for?  (I'm worried about RISC-V as much as legacy systems.)

LLVM itself I don't think is a problem, it seems to target a wide range
of platforms. The platforms that don't support JIT compiling might be a
bit larger, since that involves more than just generating code.


> I concur with your feeling that hand-rolled JIT is right out.  But
> I'm not sure that whatever performance gain we might get in this
> direction is worth the costs.

Well, I'm not impartial, but I don't think we do our users a service by
leaving significant speedups untackled, and after spending a *LOT* of
time on this, I don't see much other choice than JITing. Note that
nearly everything performance sensitive is moving towards doing JITing
in some form or another.


Greetings,

Andres Freund



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Andres Freund
Date:
On 2016-12-06 14:04:09 -0500, Robert Haas wrote:
> I've heard at least one and maybe several PGCon presentations about
> people JITing tuple deformation and getting big speedups, and I'd like
> to finally hear one from somebody who intends to integrate that into
> PostgreSQL.

I certainly want to.



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Andres Freund
Date:
On 2016-12-06 11:10:59 -0800, Andres Freund wrote:
> > * Are there any currently-interesting platforms that LLVM doesn't work
> > for?  (I'm worried about RISC-V as much as legacy systems.)
> 
> LLVM itself I don't think is a problem, it seems to target a wide range
> of platforms. The platforms that don't support JIT compiling might be a
> bit larger, since that involves more than just generating code.

The os specific part is handling the executable format. The JIT we'd be
using (MCJIT) has support for ELF, MachO, and COFF. The architecture
specific bits seem to be there for x86, arm (small endian, be), aarch64
(arm 64 bits be/le again), mips, ppc64.

Somebody is working on RISC-V support for llvm (i.e. it appears to be
working, but is not merged) - but given it's not integrated into gcc
either, I'm not seing that being an argument.

Greetings,

Andres Freund



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Robert Haas
Date:
On Tue, Dec 6, 2016 at 2:10 PM, Andres Freund <andres@anarazel.de> wrote:
>> * The sheer mass of the dependency.  What's the installed footprint of
>> LLVM, versus a Postgres server?  How hard is it to install from source?
>
> Worked for me first try, but I'm perhaps not the best person to judge.
> It does take a while to compile though (~20min on my laptop).

Presumably this is going to need to be something that a user can get
via yum install <blah> or apt-get install <blah> on common systems.

I wonder how feasible it would be to make this a run-time dependency
rather than a compile option.  That's probably overcomplicating
things, but...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Andres Freund
Date:
On 2016-12-06 15:13:21 -0500, Robert Haas wrote:
> Presumably this is going to need to be something that a user can get
> via yum install <blah> or apt-get install <blah> on common systems.

Right. apt-get install llvm-dev (or llvm-3.9-dev or such if you want to
install a specific version), does the trick here.

It's a bit easier to develop with a hand compiled version, because then
LLVM adds a bootloads of asserts to its IR builder, which catches a fair
amount of mistakes. Nothing you'd run in production though (just like
you don't use a cassert build...).


> I wonder how feasible it would be to make this a run-time dependency
> rather than a compile option.  That's probably overcomplicating
> things, but...

I don't think that's feasible at all unfortunately - the compiler IR
(which then is JITed by LLVM) is generated via another C API.  We could
rebuild that one, but that'd be a lot of work.

Andres



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Nico Williams
Date:
On Tue, Dec 06, 2016 at 01:56:28PM -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I'm posting a quite massive series of WIP patches here, to get some
> > feedback.
> 
> I guess the $64 question that has to be addressed here is whether we're
> prepared to accept LLVM as a run-time dependency.  There are some reasons
> why we might not be:
> 
> * The sheer mass of the dependency.  What's the installed footprint of
> LLVM, versus a Postgres server?  How hard is it to install from source?

As long as it's optional, does this matter?

A bigger concern might be interface stability.  IIRC the LLVM C/C++
interfaces are not very stable, but bitcode is.

> * How will we answer people who say they can't accept having a compiler
> installed on their production boxes for security reasons?

You don't need the front-ends (e.g., clang) installed in order to JIT.

> * Are there any currently-interesting platforms that LLVM doesn't work
> for?  (I'm worried about RISC-V as much as legacy systems.)

The *BSDs support more platforms than LLVM does, that's for sure.
(NetBSD supports four more, IIRC, including ia64.) But the patches make
LLVM optional anyways, so this should be a non-issue.

> I concur with your feeling that hand-rolled JIT is right out.  But

Yeah, that way lies maintenance madness.

> I'm not sure that whatever performance gain we might get in this
> direction is worth the costs.

Byte-/bit-coding query plans then JITting them is very likely to improve
performance significantly.  Whether you want the maintenance overhead is
another story.

Sometimes byte-coding + interpretation yields a significant improvement
by reducing cache pressure on the icache and the size of the program to
be interpreted.  Having the option to JIT or not JIT might be useful.

Nico
-- 



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2016-12-06 13:56:28 -0500, Tom Lane wrote:
>> I guess the $64 question that has to be addressed here is whether we're
>> prepared to accept LLVM as a run-time dependency.  There are some reasons
>> why we might not be:

> Indeed.  It'd only be a soft dependency obviously.

Oh, so we'd need to maintain both the LLVM and the traditional expression
execution code?  That seems like a bit of a pain, but maybe we can live
with it.

>> * How will we answer people who say they can't accept having a compiler
>> installed on their production boxes for security reasons?

> I think they'll just not enable --with-llvm in that case, and get
> inferior performance. Note that installing llvm does not imply
> installing a full blown C compiler (although I think that's largely
> moot, you can get pretty much the same things done with just compiling
> LLVM IR).

I'm not entirely thrilled with the idea of this being a configure-time
decision, because that forces packagers to decide for their entire
audience whether it's okay to depend on LLVM.  That would be an untenable
position to put e.g. Red Hat's packagers in: either they screw the people
who want performance or they screw the people who want security.

I think it'd be all right if we can build this so that the direct
dependency on LLVM is confined to a separately-packageable extension.
That way, a packager can produce a core postgresql-server package
that does not require LLVM, plus a postgresql-llvm package that does,
and the "no compiler please" crowd simply doesn't install the latter
package.

The alternative would be to produce two independent builds of the
server, which I suppose might be acceptable but it sure seems like
a kluge, or at least something that simply wouldn't get done by
most vendors.
        regards, tom lane



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Andres Freund
Date:
Hi,

On 2016-12-06 14:19:21 -0600, Nico Williams wrote:
> A bigger concern might be interface stability.  IIRC the LLVM C/C++
> interfaces are not very stable, but bitcode is.

The C API is a lot more stable than the C++ bit, that's the primary
reason I ended up using it, despite the C++ docs being better.


> > I concur with your feeling that hand-rolled JIT is right out.  But
> 
> Yeah, that way lies maintenance madness.

I'm not quite that sure about that. I had a lot of fun doing some
hand-rolled x86 JITing. Not that is a ward against me being mad.  But
more seriously: Manually doing a JIT gives you a lot faster compilation
times, which makes JIT applicable in a lot more situations.


> > I'm not sure that whatever performance gain we might get in this
> > direction is worth the costs.
> 
> Byte-/bit-coding query plans then JITting them is very likely to improve
> performance significantly.

Note that what I'm proposing is a far cry away from that - this converts
two (peformance wise two, size wise one) significant subsystems, but far
from all the executors to be JIT able.  I think there's some more low
hanging fruits (particularly aggregate transition functions), but
converting everything seems to hit the wrong spot in the
benefit/effort/maintainability triangle.

- Andres



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Nico Williams
Date:
On Tue, Dec 06, 2016 at 12:27:51PM -0800, Andres Freund wrote:
> On 2016-12-06 14:19:21 -0600, Nico Williams wrote:
> > A bigger concern might be interface stability.  IIRC the LLVM C/C++
> > interfaces are not very stable, but bitcode is.
> 
> The C API is a lot more stable than the C++ bit, that's the primary
> reason I ended up using it, despite the C++ docs being better.

Ah.

> > > I concur with your feeling that hand-rolled JIT is right out.  But
> > 
> > Yeah, that way lies maintenance madness.
> 
> I'm not quite that sure about that. I had a lot of fun doing some
> hand-rolled x86 JITing. Not that is a ward against me being mad.  But
> more seriously: Manually doing a JIT gives you a lot faster compilation
> times, which makes JIT applicable in a lot more situations.

What I meant is that each time there are new ISA extensions, or
differences in how relevant/significant different implementations of the
same ISA implement certain instructions, and/or every time you want to
add a new architecture... someone has to do a lot of very low-level
work.

> > > I'm not sure that whatever performance gain we might get in this
> > > direction is worth the costs.
> > 
> > Byte-/bit-coding query plans then JITting them is very likely to improve
> > performance significantly.
> 
> Note that what I'm proposing is a far cry away from that - this converts
> two (peformance wise two, size wise one) significant subsystems, but far
> from all the executors to be JIT able.  I think there's some more low

Yes, I know.

> hanging fruits (particularly aggregate transition functions), but
> converting everything seems to hit the wrong spot in the
> benefit/effort/maintainability triangle.

Maybe?  At least with the infrastructure in place for it someone might
try it and see.

Nico
-- 



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Andres Freund
Date:
On 2016-12-06 15:25:44 -0500, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2016-12-06 13:56:28 -0500, Tom Lane wrote:
> >> I guess the $64 question that has to be addressed here is whether we're
> >> prepared to accept LLVM as a run-time dependency.  There are some reasons
> >> why we might not be:
>
> > Indeed.  It'd only be a soft dependency obviously.
>
> Oh, so we'd need to maintain both the LLVM and the traditional expression
> execution code?  That seems like a bit of a pain, but maybe we can live
> with it.

Yea, that's why I converted the "traditional" expression evaluation into
a different format first - that way the duplication is a lot
lower. E.g. scalar var eval looks like:
    EEO_CASE(EEO_INNER_VAR):        {            int attnum = op->d.var.attnum;
            Assert(op->d.var.attnum >= 0);            *op->resnull = innerslot->tts_isnull[attnum];
*op->resvalue= innerslot->tts_values[attnum];            EEO_DISPATCH(op);        }
 

in normal evaluation and like
        case EEO_INNER_VAR:            {                LLVMValueRef value, isnull;                LLVMValueRef
v_attnum;
                v_attnum = LLVMConstInt(LLVMInt32Type(), op->d.var.attnum, false);                value =
LLVMBuildLoad(builder,LLVMBuildGEP(builder, v_innervalues, &v_attnum, 1, ""), "");                isnull =
LLVMBuildLoad(builder,LLVMBuildGEP(builder, v_innernulls, &v_attnum, 1, ""), "");
LLVMBuildStore(builder,value, v_resvaluep);                LLVMBuildStore(builder, isnull, v_resnullp);
 
                LLVMBuildBr(builder, opblocks[i + 1]);                break;            }
for JITed evaluation.


> I'm not entirely thrilled with the idea of this being a configure-time
> decision, because that forces packagers to decide for their entire
> audience whether it's okay to depend on LLVM.  That would be an untenable
> position to put e.g. Red Hat's packagers in: either they screw the people
> who want performance or they screw the people who want security.

Hm. I've a bit of a hard time buying the security argument here. Having
LLVM (not clang!) installed doesn't really change the picture that
much. In either case you can install binaries, and you're very likely
already using some program that does JIT internally. And postgres itself
gives you plenty of ways to execute arbitrary code as superuser.

The argument for not install a c compiler seems to be that it makes it
less convenient to build an executable. I doubt that having a C(++)
library for code generation is convenient enough to change the picture
there.

> I think it'd be all right if we can build this so that the direct
> dependency on LLVM is confined to a separately-packageable extension.
> That way, a packager can produce a core postgresql-server package
> that does not require LLVM, plus a postgresql-llvm package that does,
> and the "no compiler please" crowd simply doesn't install the latter
> package.

That should be possible, but I'm not sure it's worth the effort. The JIT
infrastructure will need resowner integration and such. We can obviously
split things so that part is independent of LLVM, but I'm unconvinced
that the benefit is large enough.


> The alternative would be to produce two independent builds of the
> server, which I suppose might be acceptable but it sure seems like
> a kluge, or at least something that simply wouldn't get done by
> most vendors.

Hm. We could make that a make target ourselves ;)

Regards,

Andres



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Andres Freund
Date:
On 2016-12-06 14:35:43 -0600, Nico Williams wrote:
> On Tue, Dec 06, 2016 at 12:27:51PM -0800, Andres Freund wrote:
> > On 2016-12-06 14:19:21 -0600, Nico Williams wrote:
> > > > I concur with your feeling that hand-rolled JIT is right out.  But
> > > 
> > > Yeah, that way lies maintenance madness.
> > 
> > I'm not quite that sure about that. I had a lot of fun doing some
> > hand-rolled x86 JITing. Not that is a ward against me being mad.  But
> > more seriously: Manually doing a JIT gives you a lot faster compilation
> > times, which makes JIT applicable in a lot more situations.
> 
> What I meant is that each time there are new ISA extensions, or
> differences in how relevant/significant different implementations of the
> same ISA implement certain instructions, and/or every time you want to
> add a new architecture... someone has to do a lot of very low-level
> work.

Yea, that's why I didn't pursue this path further. I *personally* think
it'd be perfectly fine to only support JITing on linux x86_64 and
aarch64 for now. And those I'd be willing to work on. But since I know
that's not project policy...

- Andres



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Nico Williams
Date:
On Tue, Dec 06, 2016 at 12:36:41PM -0800, Andres Freund wrote:
> On 2016-12-06 15:25:44 -0500, Tom Lane wrote:
> > I'm not entirely thrilled with the idea of this being a configure-time
> > decision, because that forces packagers to decide for their entire
> > audience whether it's okay to depend on LLVM.  That would be an untenable
> > position to put e.g. Red Hat's packagers in: either they screw the people
> > who want performance or they screw the people who want security.

There's no security issue.  The dependency is on LLVM libraries, not
LLVM front-ends (e.g., clang(1)).

I don't think there's a real issue as to distros/packagers/OS vendors.
They already have to package LLVM, and they already package LLVM
libraries separately from LLVM front-ends.

> The argument for not install a c compiler seems to be that it makes it
> less convenient to build an executable. I doubt that having a C(++)
> library for code generation is convenient enough to change the picture
> there.

The security argument goes back to the days of the Morris worm, which
depended on having developer tools (specifically in that case, ld(1),
the link-editor).  But JIT via LLVM won't give hackers a way to generate
or link arbitrary object code.

Nico
-- 



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Peter Geoghegan
Date:
On Mon, Dec 5, 2016 at 7:49 PM, Andres Freund <andres@anarazel.de> wrote:
> I tried to address 2) by changing the C implementation. That brings some
> measurable speedups, but it's not huge. A bigger speedup is making
> slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers;
> but it's still not huge.  Finally I turned to just-in-time (JIT)
> compiling the code for tuple deforming. That doesn't save the cost of
> 1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01).  The
> first part is done in 0008, the JITing in 0012.

A more complete motivating example would be nice. For example, it
would be nice to see the overall speedup for some particular TPC-H
query.


-- 
Peter Geoghegan



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Andres Freund
Date:
On 2016-12-06 13:27:14 -0800, Peter Geoghegan wrote:
> On Mon, Dec 5, 2016 at 7:49 PM, Andres Freund <andres@anarazel.de> wrote:
> > I tried to address 2) by changing the C implementation. That brings some
> > measurable speedups, but it's not huge. A bigger speedup is making
> > slot_getattr, slot_getsomeattrs, slot_getallattrs very trivial wrappers;
> > but it's still not huge.  Finally I turned to just-in-time (JIT)
> > compiling the code for tuple deforming. That doesn't save the cost of
> > 1), but it gets rid of most of 2) (from ~15% to ~3% in TPCH-Q01).  The
> > first part is done in 0008, the JITing in 0012.
>
> A more complete motivating example would be nice. For example, it
> would be nice to see the overall speedup for some particular TPC-H
> query.

Well, it's a bit WIP-y for that - not all TPCH queries run JITed yet, as
I've not done that for enough expression types... And you run quickly
into other bottlenecks.

But here we go for TPCH (scale 10) Q01:
master:
Time: 33885.381 ms 16.29%  postgres  postgres          [.] slot_getattr 12.85%  postgres  postgres          [.]
ExecMakeFunctionResultNoSets10.85%  postgres  postgres          [.] advance_aggregates  6.91%  postgres  postgres
  [.] slot_deform_tuple  6.70%  postgres  postgres          [.] advance_transition_function  4.59%  postgres  postgres
       [.] ExecProject  4.25%  postgres  postgres          [.] float8_accum  3.69%  postgres  postgres          [.]
tuplehash_insert 2.39%  postgres  postgres          [.] float8pl  2.20%  postgres  postgres          [.] bpchareq
2.03% postgres  postgres          [.] check_stack_depth
 

profile:

(note that all expression evaluated things are distributed among many
functions)

dev (no jiting):
Time: 30343.532 ms

profile: 16.57%  postgres  postgres          [.] slot_deform_tuple 13.39%  postgres  postgres          [.] ExecEvalExpr
8.64%  postgres  postgres          [.] advance_aggregates  8.58%  postgres  postgres          [.]
advance_transition_function 5.83%  postgres  postgres          [.] float8_accum  5.14%  postgres  postgres          [.]
tuplehash_insert 3.89%  postgres  postgres          [.] float8pl  3.60%  postgres  postgres          [.] slot_getattr
2.66% postgres  postgres          [.] bpchareq  2.56%  postgres  postgres          [.] heap_getnext
 

dev (jiting):
SET jit_tuple_deforming = on;
SET jit_expressions = true;

Time: 24439.803 ms

profile: 11.11%  postgres  postgres             [.] slot_deform_tuple 10.87%  postgres  postgres             [.]
advance_aggregates 9.74%  postgres  postgres             [.] advance_transition_function  6.53%  postgres  postgres
       [.] float8_accum  5.25%  postgres  postgres             [.] tuplehash_insert  4.31%  postgres  perf-10698.map
  [.] deform0  3.68%  postgres  perf-10698.map       [.] evalexpr6  3.53%  postgres  postgres             [.]
slot_getattr 3.41%  postgres  postgres             [.] float8pl  2.84%  postgres  postgres             [.] bpchareq
 

(note how expression eval when from 13.39% to roughly 4%)

The slot_deform_cost here is primarily cache misses. If you do the
"memory order" iteration, it drops significantly.

The JIT generated code still leaves a lot on the table, i.e. this is
definitely not the best we can do.  We also deform half the tuple twice,
because I've not yet added support for starting to deform in the middle
of a tuple.

Independent of new expression evaluation and/or JITing, if you make
advance_aggregates and advance_transition_function inline functions (or
you do profiling accounting for children), you'll notice that ExecAgg()
+ advance_aggregates + advance_transition_function themselves take up
about 20% cpu-time.  That's *not* including the hashtable management,
the actual transition functions, and such themselves.


If you have queries where tuple deforming is a bigger proportion of the
load, or where expression evalution (including projection) is a larger
part (any NULLs e.g.) you can get a lot bigger wins, even without
actually optimizing the generated code (which I've not yet done).

Just btw: float8_accum really should use an internal aggregation type
instead of using postgres array...


Andres



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Craig Ringer
Date:
On 7 December 2016 at 04:13, Robert Haas <robertmhaas@gmail.com> wrote:

> I wonder how feasible it would be to make this a run-time dependency
> rather than a compile option.

Or something that's compiled with the server, but produces a separate
.so that's the only thing that links to LLVM. So packagers can avoid a
dependency on LLVM for postgres.

I suspect it wouldn't be worth the complexity, the added indirection
necessary, etc. If you're using packages then pulling in LLVM isn't a
big deal. If you're not, then don't use --with-llvm .

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: WIP: Faster Expression Processing and Tuple Deforming (including JIT)

From
Craig Ringer
Date:
On 7 December 2016 at 14:39, Craig Ringer <craig@2ndquadrant.com> wrote:
> On 7 December 2016 at 04:13, Robert Haas <robertmhaas@gmail.com> wrote:
>
>> I wonder how feasible it would be to make this a run-time dependency
>> rather than a compile option.
>
> Or something that's compiled with the server, but produces a separate
> .so that's the only thing that links to LLVM. So packagers can avoid a
> dependency on LLVM for postgres.

Ahem, next time I'll finish the thread first. Nevermind.

-- Craig Ringer                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



On Tue, Dec 6, 2016 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote:
> 0009  WIP: Add minimal keytest implementation.
>
> More or less experimental patch that tries to implement simple
> expression of the OpExpr(ScalarVar, Const) into a single expression
> evaluation step.  The benefits probably aren't big enough iff we do end
> up doing JITing of expressions.

Seems like we are try to achieve same thing with 'heap scan key push
down patch'[1] as well. But I think with this patch you are covering
OpExpr(ScalarVar, Const) for all the cases, wherein with [1] we are
currently only doing it for seqscan and we are trying to make that
generic for other node as well.

So do you see any advantage of continuing [1] ?  Is there something
extra we can achieve with [1] what we can not get with this patch ?

https://www.postgresql.org/message-id/CAFiTN-takT6Z4s3tGDwyC9bhYf%2B1gumpvW5bo_fpeNUy%2BrL-kg%40mail.gmail.com

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Andres,


> dev (no jiting):> Time: 30343.532 ms


> dev (jiting):
> SET jit_tuple_deforming = on;
> SET jit_expressions = true;

> Time: 24439.803 ms

FYI, ~20% improvement for TPCH Q1 is consistent with what we find when we only jit expression. 

Cheers,
-cktan


Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)

From
Andres Freund
Date:
Hi,

On 2016-12-12 18:11:13 -0800, CK Tan wrote:
> Andres,

> > dev (no jiting):
> > Time: 30343.532 ms
> 
> > dev (jiting):
> > SET jit_tuple_deforming = on;
> > SET jit_expressions = true;
> >
> > Time: 24439.803 ms
> 
> FYI, ~20% improvement for TPCH Q1 is consistent with what we find when we
> only jit expression.

For Q1 I think the bigger win is JITing the transition function
invocation in advance_aggregates/transition_function - that's IIRC where
the biggest bottleneck lies.

If you have any details about your JITing experience that you're willing
to talk about ...

Regards,

Andres



On Mon, Dec 12, 2016 at 6:14 PM, Andres Freund <andres@anarazel.de> wrote:
>
>
> For Q1 I think the bigger win is JITing the transition function
> invocation in advance_aggregates/transition_function - that's IIRC where
> the biggest bottleneck lies.
>

Yeah, we bundle the agg core into our expr work... no point otherwise since we do
it for OLAP.

As for experience, I think you have found out for yourself. There is a lot that
can be done and heuristics are involved in many places to decide whether
to jit fully, partially, or not at all.  But it looks like you have a solid basis now
to proceed and explore the beyond :-)

Send me private email if you have a particular question.

Regards,
-cktan

Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)

From
Bruce Momjian
Date:
On Tue, Dec  6, 2016 at 11:10:59AM -0800, Andres Freund wrote:
> > I concur with your feeling that hand-rolled JIT is right out.  But
> > I'm not sure that whatever performance gain we might get in this
> > direction is worth the costs.
> 
> Well, I'm not impartial, but I don't think we do our users a service by
> leaving significant speedups untackled, and after spending a *LOT* of
> time on this, I don't see much other choice than JITing. Note that
> nearly everything performance sensitive is moving towards doing JITing
> in some form or another.

Agreed, we don't really have a choice.  After all the optimizations we
have done to so many subsystems, our executor is relatively slow and is
a major drag on our system, specifically for long-running queries.  The
base problem with the executor are the state machines at so many levels,
and we really can't optimize that while keeping a reasonable maintenance
burden.

This is where JIT and LLVM help.  I outlined two external projects that
were researching this in this blog entry:
http://momjian.us/main/blogs/pgblog/2016.html#April_1_2016

I am excited to now be seeing WIP code.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



[HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

Here's an updated version of this patchset (now that SRFs are
implemented "properly").  This is a large patchset, with later patches
being more experimental than earlier ones.

0001: Add expression dependencies on composite type / whole row components.
  This patch later allows to remove some execution time check (in
  ExecEvalScalarVar / ExecEvalFieldSelect), which'd be inconvenient to
  do when JITed.

0002: Make get_last_attnums more generic.
  What it says.

0003: autoconf test for computed goto support.
  The faster expression evaluation implementation uses computed gotos
  when available. So we need to check whether the current compiler
  supports those.

0004: WIP: Faster expression processing and targetlist projection.
  The most important patch of the series. Changes expression evaluation
  from being tree-walk based as done in execQual.c, into being a series
  of instructions.  Then adds an opcode dispatch based interpreter for
  that mini language.  That allows to later add different
  implementations.  Projection is now a special case for expression
  evaluation.

0005: WIP: Add configure infrastructure to enable LLVM.
0006: WIP: Beginning of a LLVM JIT infrastructure.
  Needs work.

0007: WIP: JITed expression evaluation.
  This finally performs runtime compilation of expressions.


I think up to 0004 things are getting pretty close to being ready.
There's some minor FIXMEs that need to addressed (primarily around
CaseTestExpr/CoerceToDomainValue when invoked outside of plans), and
some more comment work.

The biggest remaining piece of work in the whole series is 0006. Right
compiled expressions can be leaked, which is obviously not ok. So I need
to write ResourceOwner integration.

0007 covers nearly all types of expressions /queries (function usage
isn't supported right now, and there's two missing calls to
MakeExpandedObjectReadOnly).  There's a lot more that could be done for
efficiency (generating smarter LLVM IR basically), but we already see
quite significant performance wins.


I have later patches that
- centralizes tuple deforming logic into slot_deform_logic from
  slot_getattr, slot_getsomeatts,slot_getallattrs and such
- add JITing of tuple deforming
- add JITing of aggregation transitions
  this works by converting transition functions into expression opcodes,
  and then adding JIT support for those new instructions
  (EEO_AGG_FILTER / STRICT_INPUT_CHECK / INIT_TRANS / STRICT_TRANS_CHECK
  / PLAIN_TRANS / ORDERED_TRANS_DATUM / ORDERED_TRANS_TUPLE)

but the above 7 patches already are large enough imo.


For now 0004 is I think the patch that really could use a look by
somebody but me - it's quite a radical departure from how things
currently look in master, and it's a fairly large patch.  I don't really
see a good way to break it down into useful parts :( (the projection
changes are the only thing that I can see usefully being done
separately).

Regards,

Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)

From
Markus Nullmeier
Date:
On 12/06/16 21:40, Andres Freund wrote:
> On 2016-12-06 14:35:43 -0600, Nico Williams wrote:
>> On Tue, Dec 06, 2016 at 12:27:51PM -0800, Andres Freund wrote:
>>> On 2016-12-06 14:19:21 -0600, Nico Williams wrote:
>>>>> I concur with your feeling that hand-rolled JIT is right out.  But
>>>>
>>>> Yeah, that way lies maintenance madness.
>>>
>>> I'm not quite that sure about that. I had a lot of fun doing some
>>> hand-rolled x86 JITing. Not that is a ward against me being mad.  But
>>> more seriously: Manually doing a JIT gives you a lot faster compilation
>>> times, which makes JIT applicable in a lot more situations.
>>
>> What I meant is that each time there are new ISA extensions, or
>> differences in how relevant/significant different implementations of the
>> same ISA implement certain instructions, and/or every time you want to
>> add a new architecture... someone has to do a lot of very low-level
>> work.
> 
> Yea, that's why I didn't pursue this path further. I *personally* think
> it'd be perfectly fine to only support JITing on linux x86_64 and
> aarch64 for now. And those I'd be willing to work on. But since I know
> that's not project policy...

Well, if this thread of thought about hand-crafted JIT should be taken
up again by someone at some point in time, I guess it could be possible
to reuse tools that are already out there, such as "DynASM"
( http://luajit.org/dynasm_features.html ) from the LuaJIT project
(currently offers x86-64, x86-32, ARM-32, PPC-32, and MIPS-32).
Maybe one could even recycle parts of the LuaJIT project itself
( http://luajit.org/luajit.html , http://article.gmane.org/gmane.comp.lang.lua.general/58908 ).

-- 
Markus Nullmeier            http://www.g-vo.org
German Astrophysical Virtual Observatory (GAVO)




Re: [HACKERS] WIP: Faster Expression Processing and Tuple Deforming(including JIT)

From
Andres Freund
Date:
Hi,

On 2017-02-10 18:18:13 +0100, Markus Nullmeier wrote:
> Well, if this thread of thought about hand-crafted JIT should be taken
> up again by someone at some point in time, I guess it could be possible
> to reuse tools that are already out there, such as "DynASM"
> ( http://luajit.org/dynasm_features.html ) from the LuaJIT project
> (currently offers x86-64, x86-32, ARM-32, PPC-32, and MIPS-32).
> Maybe one could even recycle parts of the LuaJIT project itself
> ( http://luajit.org/luajit.html ,
>   http://article.gmane.org/gmane.comp.lang.lua.general/58908 ).

FWIW, I'd looked at dynasm/luajit.

One big reason to go for LLVM is that it has nearly all the
infrastructure to make backend-functions/operators inlineable.
Especially for some of the arithmetic operations and such, that'd be
quite useful performance-wise.  With LLVM you can just use clang on C to
generate the IR, do some work to boil down the IR modules to the
relevant functions (i.e. remove non sql-callable functions), for which
LLVM has infrastructure, and then inline the functions that way.  That's
a lot harder to do with nearly everything else (save gcc's jit library,
but the licensing and stability situation makes that unattractive.

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

Attached is an updated version of the patchset, but more importantly
some benchmark numbers.

I ran tpch scale 5, on my laptop with
-c shared_buffers=20GB # all data fits into memory
-c max_parallel_workers_per_gather=0 # reduce variability
-c work_mem=4GB # other suggestions welcome
-c huge_pages=on #reduces variability

My benchmarking script first prewarms the whole database, then runs the
tpch queries in sequence, repeated three times, and compares the shortes
execution time:

master q01 min: 16176.843 dev-llvm-off min: 13153.179 [diff -22.99]
master q02 min: 1280.691 dev-llvm-off min: 1263.032 [diff -1.40]
master q03 min: 7330.737 dev-llvm-off min: 7144.982 [diff -2.60]
master q04 min: 1014.347 dev-llvm-off min: 991.008 [diff -2.36]
master q05 min: 5490.103 dev-llvm-off min: 5439.878 [diff -0.92]
master q06 min: 1916.45 dev-llvm-off min: 1818.839 [diff -5.37]
master q07 min: 5282.129 dev-llvm-off min: 5222.879 [diff -1.13]
master q08 min: 1655.824 dev-llvm-off min: 1532.1 [diff -8.08]
master q09 min: 7009.372 dev-llvm-off min: 6724.515 [diff -4.24]
master q10 min: 6017.01 dev-llvm-off min: 5848.337 [diff -2.88]
master q11 min: 316.724 dev-llvm-off min: 292.51 [diff -8.28]
master q12 min: 4829.502 dev-llvm-off min: 4698.14 [diff -2.80]
master q13 min: 8679.991 dev-llvm-off min: 8427.614 [diff -2.99]
master q14 min: 814.109 dev-llvm-off min: 774.805 [diff -5.07]
master q15 min: 1957.248 dev-llvm-off min: 1841.377 [diff -6.29]
master q16 min: 1976.544 dev-llvm-off min: 1936.932 [diff -2.05]
master q17 min: 559.664 dev-llvm-off min: 446.199 [diff -25.43]
master q18 min: 16565.722 dev-llvm-off min: 15475.372 [diff -7.05]
master q19 min: 385.222 dev-llvm-off min: 310.398 [diff -24.11]
master q20 min: 1717.015 dev-llvm-off min: 1389.064 [diff -23.61]
master q22 min: 753.017 dev-llvm-off min: 637.152 [diff -18.18]

(note that there's a fair amount of per-run variance, I've seen one or
two of the small differences go the other way in previous runs)

It's clearly visible that the performance gain heavily depends on the
type of query.  Which makes sense - expression evaluation isn't a
bottleneck everywhere.  The actual gains in expression evalution are
larger than the maximum here, but bottlenecks shift.

Besides cleanups the important changes in this version of the patches
are:
- I've added a fastpath for the interpretation of very simple
  expressions (non-sysvar, single column Vars and Consts), before that
  performance regressed slightly for cases that evaluated a *lot* of
  Vars/Consts. The only case where I could actually reproduce that is in
  large hash-joins where the to-be-hashed value is extracted on its own.[1;5A
- I moved the invocation of expression evaluation back to a
  callback. That's better for predicting branches both with JITing and
  when using fastpath functions.
- removed used EXEC_EVALDEBUG code

- Andres

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Faster Expression Processing v4

From
Robert Haas
Date:
On Mon, Mar 6, 2017 at 10:01 PM, Andres Freund <andres@anarazel.de> wrote:
> My benchmarking script first prewarms the whole database, then runs the
> tpch queries in sequence, repeated three times, and compares the shortes
> execution time:

Those numbers are stupendous.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Peter Eisentraut
Date:
I looked over
0001-Add-expression-dependencies-on-composite-type-whole-.patch.  That
seems useful independent of the things you have following.

Even though it is presented more as a preparatory change, it appears to
have noticeable user-facing impact, which we should analyze.  For
example, in the regression test, the command
   alter table tt14t drop column f3;

is now rejected, rightly, but the command
   alter table tt14t drop column f3 cascade;

ends up dropping the whole view, which might be a bit surprising.  We
don't support dropping columns from views, so there might be no
alternative, but I wonder what else is changing because of this.  I
think that might deserve a bit more analysis and perhaps some test cases.


--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -206,6 +206,9 @@ static bool object_address_present_add_flags(const
ObjectAddress *object,static bool stack_address_present_add_flags(const ObjectAddress *object,
                                    int flags,
 

ObjectAddressStack *stack);
+static void add_type_addresses(Oid typeid, bool include_components,
ObjectAddresses *addrs);
+static void add_type_component_addresses(Oid typeid, ObjectAddresses
*addrs);
+static void add_class_component_addresses(Oid relid, ObjectAddresses
*addrs);static void DeleteInitPrivs(const ObjectAddress *object);

For some reason, the function add_object_address() is singular, and
these new functions are plural  Clearly, they take a plural argument, so
your version seems more correct, but I wonder if we should keep that
consistent.


+                        * e.g. not to the right thing for column-less
tables.  Note that

Small typo there.  (to -> do)

@@ -1639,6 +1641,9 @@ find_expr_references_walker(Node *node,
               add_object_address(OCLASS_PROC, funcexpr->funcid, 0,
context->addrs);
+               /* dependency on type itself already exists via function */
+               add_type_component_addresses(funcexpr->funcresulttype,
context->addrs);
+               /* fall through to examine arguments */       }       else if (IsA(node, OpExpr))

Why shouldn't the function itself also depend on the components of its
return type?

How are argument types handled?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-07 18:46:31 -0500, Peter Eisentraut wrote:
> I looked over
> 0001-Add-expression-dependencies-on-composite-type-whole-.patch.  That
> seems useful independent of the things you have following.
> 
> Even though it is presented more as a preparatory change, it appears to
> have noticeable user-facing impact, which we should analyze.

Indeed.


> For
> example, in the regression test, the command
> 
>     alter table tt14t drop column f3;
> 
> is now rejected, rightly, but the command
> 
>     alter table tt14t drop column f3 cascade;
> 
> ends up dropping the whole view, which might be a bit surprising.  We
> don't support dropping columns from views, so there might be no
> alternative

It seems strictly better than silently breaking the view.


> , but I wonder what else is changing because of this.  I
> think that might deserve a bit more analysis and perhaps some test cases.

There are already some tests. More is probably good - if you have
specific cases in mind...


> --- a/src/backend/catalog/dependency.c
> +++ b/src/backend/catalog/dependency.c
> @@ -206,6 +206,9 @@ static bool object_address_present_add_flags(const
> ObjectAddress *object,
>  static bool stack_address_present_add_flags(const ObjectAddress *object,
>                                                                 int flags,
> 
> ObjectAddressStack *stack);
> +static void add_type_addresses(Oid typeid, bool include_components,
> ObjectAddresses *addrs);
> +static void add_type_component_addresses(Oid typeid, ObjectAddresses
> *addrs);
> +static void add_class_component_addresses(Oid relid, ObjectAddresses
> *addrs);
>  static void DeleteInitPrivs(const ObjectAddress *object);
> 
> For some reason, the function add_object_address() is singular, and
> these new functions are plural  Clearly, they take a plural argument, so
> your version seems more correct, but I wonder if we should keep that
> consistent.

I named them plural, because add_object_address only adds a single new
entry, but add_type_addresses etc potentially add multiple ones.  So I
think plural/singular as used here is actually consistent?


> +                        * e.g. not to the right thing for column-less
> tables.  Note that
> 
> Small typo there.  (to -> do)

Oops.


> @@ -1639,6 +1641,9 @@ find_expr_references_walker(Node *node,
> 
>                 add_object_address(OCLASS_PROC, funcexpr->funcid, 0,
>                                                    context->addrs);
> +               /* dependency on type itself already exists via function */
> +               add_type_component_addresses(funcexpr->funcresulttype,
> context->addrs);
> +
>                 /* fall through to examine arguments */
>         }
>         else if (IsA(node, OpExpr))
> 
> Why shouldn't the function itself also depend on the components of its
> return type?

Because that'd make it impossible to change the return type components -
if the return type is baked into the view that's necessary, but for a
"freestanding function" it's not.  If you e.g. have a function that just
returns a table's rows, it'd certainly be annoying if that'd prevent you
from dropping columns.


> How are argument types handled?

We fall through to
return expression_tree_walker(node, find_expr_references_walker,                              (void *) context);

which'll add a dependency if necessary.  If it's e.g. a table column,
function return type or such we'll add a dependency there.


Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Peter Eisentraut
Date:
On 3/7/17 19:14, Andres Freund wrote:
>> Why shouldn't the function itself also depend on the components of its
>> return type?
> Because that'd make it impossible to change the return type components -
> if the return type is baked into the view that's necessary, but for a
> "freestanding function" it's not.  If you e.g. have a function that just
> returns a table's rows, it'd certainly be annoying if that'd prevent you
> from dropping columns.

I think functions breaking when the return type is fiddled with is
actually a not-uncommon problem in practice.  It would be nice if that
could be addressed.  Not necessarily in this patch, but I would like to
keep the options open.  Comments from others would be welcome on this.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Peter Eisentraut
Date:
I have also looked at the 0002 and 0003 patches, and they seem OK, but
they are obviously not of much use without 0004.  What is your ambition
for getting 0004 reviewed and committed for PG10?

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-10 08:51:25 -0500, Peter Eisentraut wrote:
> On 3/7/17 19:14, Andres Freund wrote:
> >> Why shouldn't the function itself also depend on the components of its
> >> return type?
> > Because that'd make it impossible to change the return type components -
> > if the return type is baked into the view that's necessary, but for a
> > "freestanding function" it's not.  If you e.g. have a function that just
> > returns a table's rows, it'd certainly be annoying if that'd prevent you
> > from dropping columns.
> 
> I think functions breaking when the return type is fiddled with is
> actually a not-uncommon problem in practice.  It would be nice if that
> could be addressed.

What problem are you thinking of exactly, and what'd be the solution?
I'd guess that people wouldn't like being unable to change columns in a
table if some function returned the type.

One rather extreme way would be to have functions return a "different"
type, that's initially identical to the table/composite type.  If the
ocmposite type then changes, we'd transparently map between the actual
return type, and the one callers expect.   But that'd a fair bit of
effort, and presumably also quite confusing for users that might
e.g. expect a new column to show up.


> Not necessarily in this patch, but I would like to keep the options
> open.

Yea, seems worth thinking about.  I don't think the patch closes down
options?

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-10 09:00:14 -0500, Peter Eisentraut wrote:
> I have also looked at the 0002 and 0003 patches, and they seem OK, but
> they are obviously not of much use without 0004.  What is your ambition
> for getting 0004 reviewed and committed for PG10?

I'd really like to get it in.  The performance improvements on its own
are significant, and it provides the basis for significantly larger
improvements again (JIT) - those followup improvements are large patches
again though, so I'd rather not do all of that next cycle.

My next step (over the weekend) is to add tests to execQual.c to get it
a good chunk closer to 100% test coverage, and then do the same for the
new implementation (there's probably very little additional tests needed
after the conversion).  Given all tests pass before/after, and there's a
lot of them, I think we can have a reasonable confidence of a low bug
density.

Large parts of the changes are fairly mechanical, the interesting bits
probably are:

- there's a lot fewer "full blown" node types for expression evaluation anymore. In most cases there's now only a
top-levelExprState node that's nodeTag() able.  The actual information to execute an instruction now is in
ExprState->steps[i]
- because of the previous point, it now isn't legal anymore to call ExecInitExpr() on a list of expressions -
ExecInitExprList()is a convenience wrapper around manually iterating over the list (gets rid of ugly casts, too)
 
- Because they behave differently on an actual expression evaluation basis, quals / check constraint, now have to be
initializedwith ExecInitQual/ExecInitCheck(). That way the shortcut behaviour and such can be retained, while also
beingfaster.
 
- PlanState->targetlist is gone. Because expressions aren't initialized on their own anymore, but as part of
ExecProject,there's no need for it.
 
- The seperation between ExecInitExprRec() and ExecEvalExpr/ExecInterpExpr() is different - we try to do as much as
possibleduring initialization.
 
- I retained some ugly hackering around innermost_caseval|null/domainval|null - I had started down a path of removing
theuse of those in random parts of the system, but I ended up not going for it.
 

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Peter Eisentraut
Date:
On 3/10/17 14:24, Andres Freund wrote:
> What problem are you thinking of exactly, and what'd be the solution?
> I'd guess that people wouldn't like being unable to change columns in a
> table if some function returned the type.

Well, that's pretty much the problem.

For example:

create type t1 as (a int, b int);
create function f1() returns setof t1 language plpgsql   as $$ begin return query values (1, 2); end $$;
alter type t1 drop attribute b;
select * from f1();  -- fail

The dependency system should arguably guard against that.  Yes, it would
make some things more cumbersome, but, well, it already does that.

>> Not necessarily in this patch, but I would like to keep the options
>> open.
> 
> Yea, seems worth thinking about.  I don't think the patch closes down
> options?

nope

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Peter Eisentraut
Date:
On 3/10/17 14:40, Andres Freund wrote:
> I'd really like to get it in.  The performance improvements on its own
> are significant, and it provides the basis for significantly larger
> improvements again (JIT) - those followup improvements are large patches
> again though, so I'd rather not do all of that next cycle.
> 
> My next step (over the weekend) is to add tests to execQual.c to get it
> a good chunk closer to 100% test coverage, and then do the same for the
> new implementation (there's probably very little additional tests needed
> after the conversion).  Given all tests pass before/after, and there's a
> lot of them, I think we can have a reasonable confidence of a low bug
> density.

That seems like a plan, but now would probably be a good time for some
other hackers to take note of this proposal.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Bruce Momjian
Date:
On Fri, Mar 10, 2017 at 07:15:59PM -0500, Peter Eisentraut wrote:
> On 3/10/17 14:40, Andres Freund wrote:
> > I'd really like to get it in.  The performance improvements on its own
> > are significant, and it provides the basis for significantly larger
> > improvements again (JIT) - those followup improvements are large patches
> > again though, so I'd rather not do all of that next cycle.
> > 
> > My next step (over the weekend) is to add tests to execQual.c to get it
> > a good chunk closer to 100% test coverage, and then do the same for the
> > new implementation (there's probably very little additional tests needed
> > after the conversion).  Given all tests pass before/after, and there's a
> > lot of them, I think we can have a reasonable confidence of a low bug
> > density.
> 
> That seems like a plan, but now would probably be a good time for some
> other hackers to take note of this proposal.

Well, the executor is long overdue for improvement, so I would love to
have this in 10.0.  I am not sure what additional polishing will happen
if we punt it for 11.0.  I think the only downside of having it in 10.0
is that it will not have lived in the source tree for as long a time
between commit and PG release, but if it is tested, that seems fine.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Robert Haas
Date:
On Fri, Mar 10, 2017 at 7:42 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Fri, Mar 10, 2017 at 07:15:59PM -0500, Peter Eisentraut wrote:
>> On 3/10/17 14:40, Andres Freund wrote:
>> > I'd really like to get it in.  The performance improvements on its own
>> > are significant, and it provides the basis for significantly larger
>> > improvements again (JIT) - those followup improvements are large patches
>> > again though, so I'd rather not do all of that next cycle.
>> >
>> > My next step (over the weekend) is to add tests to execQual.c to get it
>> > a good chunk closer to 100% test coverage, and then do the same for the
>> > new implementation (there's probably very little additional tests needed
>> > after the conversion).  Given all tests pass before/after, and there's a
>> > lot of them, I think we can have a reasonable confidence of a low bug
>> > density.
>>
>> That seems like a plan, but now would probably be a good time for some
>> other hackers to take note of this proposal.
>
> Well, the executor is long overdue for improvement, so I would love to
> have this in 10.0.  I am not sure what additional polishing will happen
> if we punt it for 11.0.  I think the only downside of having it in 10.0
> is that it will not have lived in the source tree for as long a time
> between commit and PG release, but if it is tested, that seems fine.

Yeah.  I think we can't rule out the possibility that this patch is
full of bugs, but (1) a lot of the change is pretty mechanical and (2)
none of this touches the on-disk format or the catalog contents, so it
doesn't break anybody's existing install if we have to patch it.  On
the other hand, (3) it's a big patch and (4) if expression evaluation
doesn't work, PostgreSQL is pretty much unusable.

So my feeling is:

1. If Andres commits this and it turns out to be seriously buggy, he
should be prepared to revert it without a big fight.  We can try again
for v11.

2. If Andres commits this and it turns out to be mildly buggy, he
should be prepared to address bugs very proactively and very promptly.

3. If Andres wants to commit this for v10, it should be done RSN.
Feature freeze isn't technically until the end of March, but giant
potentialy-destabilizing patches that land on March 31st are likely to
lead to outcomes that nobody wants.

Like, I think, everyone else, I'd really like to have these speedups
and I think they will benefit everyone who uses PostgreSQL.  However,
I think that we cannot afford to have a repeat of the simplehash thing
where serious bugs sat around for months without really getting
properly addressed.  If that happens with this, I will be extremely
unhappy, and I'm fairly sure I won't be alone.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tomas Vondra
Date:
Hi,

On 03/07/2017 04:01 AM, Andres Freund wrote:
> Hi,
>
> Attached is an updated version of the patchset, but more importantly
> some benchmark numbers.
>

I wanted to do a bit of testing and benchmarking on this, but 0004 seems 
to be a bit broken. The patch does not apply anymore - there are some 
conflicts in execQual.c, but I think I fixed those. But then I ran into 
a bunch of compile-time errors, because some of the executor nodes still 
reference bits that were moved elsewhere.

E.g. there's no targetlist in PlanState anymore, yet nodeGatherMerge and 
nodeTableFuncscan do this:
gm_state->ps.targetlist = (List *)    ExecInitExpr((Expr *) node->plan.targetlist,                 (PlanState *)
gm_state);

Some of the nodes also assign to ps.qual values that are (List *), but 
that field was switched to ExprState. That needs fixing, I guess.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:
> I wanted to do a bit of testing and benchmarking on this, but 0004 seems to
> be a bit broken.

Well, "broken" in the sense that it's already outdated, because other
stuff that got merged.


> The patch does not apply anymore - there are some conflicts
> in execQual.c, but I think I fixed those. But then I ran into a bunch of
> compile-time errors, because some of the executor nodes still reference bits
> that were moved elsewhere.

Updated patch attached.  Note that this patch has two changes I've not
yet evaluated performance-wise.


Questions:
- New code calls UpdateDomainConstraintRef at the beginning of
  expression evaluation, not for each row.  That allows to generate
  "proper programs" for constraint checking, which is quite useful for
  performance.  But it's a behaviour change.
- I have *not* changed that, but I'm less than convinced of the pattern
  of "runtime" get_cached_rowtype calls.

Changes:
- comments
- fix minor bug in BooleanTest brought to ligth thanks to the new
  regression tests (No, IS TRUE isn't the same IS NOT FALSE)
- fix (theoretical?) issue with get_cached_rowtype() being called once
  for deforming and once for forming a tuple for FieldStores.
- fixed/implemented an optimization XXX around bool AND/OR because the
  solution actually makes the code easier.

What's basically missing here is:
- pgindent run and minimizing resulting damage
- A higher level explanation of how the interpreted expression
  evaluation works.  This patch implements "fairly standard" approach,
  but it's imo sufficiently
- There's one remaining FIXME that needs to be fixed.
- I sometimes see a minor performance regression (~1.2%) in tpch-h's Q17
  that might or might not be code layout related.
- continue pass to ensure all worthwhile comments from execQual.c are
  retained.

Could be delayed:
- add regression tests for track_functions - I've manually verified it
  works, and we previously didn't have tests either.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Faster Expression Processing v4

From
Konstantin Knizhnik
Date:


On 13.03.2017 11:03, Andres Freund wrote:
Hi,

On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:
I wanted to do a bit of testing and benchmarking on this, but 0004 seems to
be a bit broken.
Well, "broken" in the sense that it's already outdated, because other
stuff that got merged.


The patch does not apply anymore - there are some conflicts
in execQual.c, but I think I fixed those. But then I ran into a bunch of
compile-time errors, because some of the executor nodes still reference bits
that were moved elsewhere.
Updated patch attached.  Note that this patch has two changes I've not
yet evaluated performance-wise.


I got the following results at my system with Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz, 16Gb RAM,
TPC-H Q1/Q6 scale 10, sharedBuffers=8Gb, pg_prewarm on lineitem table projection:


Q1
Q6
Master
7503 ms 1171 ms
Your patch6420 ms1034 ms
VOPS
396 ms
249 ms
VOPS + patch367 ms
233 ms


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tomas Vondra
Date:
On 03/13/2017 09:03 AM, Andres Freund wrote:
> Hi,
>
> On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:
>> I wanted to do a bit of testing and benchmarking on this, but 0004 seems to
>> be a bit broken.
>
> Well, "broken" in the sense that it's already outdated, because other
> stuff that got merged.
>

Yes, that's what I meant. Sorry for not being quite clear.

>
>> The patch does not apply anymore - there are some conflicts
>> in execQual.c, but I think I fixed those. But then I ran into a bunch of
>> compile-time errors, because some of the executor nodes still reference bits
>> that were moved elsewhere.
>
> Updated patch attached.  Note that this patch has two changes I've not
> yet evaluated performance-wise.
>

And which are those?

I plan to do a bit of testing / benchmarking on this later this week, 
depending on other testing already happening on my machine.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-13 01:03:51 -0700, Andres Freund wrote:
> What's basically missing here is:
> - pgindent run and minimizing resulting damage

Running into a bit of an issue here - pgindent mangles something like
   EEO_SWITCH (op->opcode)   {       EEO_CASE(EEO_DONE):           goto out;
       EEO_CASE(EEO_INNER_FETCHSOME):           /* XXX: worthwhile to check tts_nvalid inline first? */
slot_getsomeattrs(innerslot,op->d.fetch.last_var);           EEO_DISPATCH(op);
 
       EEO_CASE(EEO_OUTER_FETCHSOME):           slot_getsomeattrs(outerslot, op->d.fetch.last_var);
EEO_DISPATCH(op);
       EEO_CASE(EEO_SCAN_FETCHSOME):           slot_getsomeattrs(scanslot, op->d.fetch.last_var);
EEO_DISPATCH(op);
       EEO_CASE(EEO_INNER_VAR):           {               int attnum = op->d.var.attnum;
               /*                * Can't assert tts_nvalid, as wholerow var evaluation or such                * could
havematerialized the slot - but the contents are                * still valid :/                */
Assert(op->d.var.attnum>= 0);               *op->resnull = innerslot->tts_isnull[attnum];               *op->resvalue =
innerslot->tts_values[attnum];              EEO_DISPATCH(op);           }
 

into
   EEO_SWITCH(op->opcode)   {
EEO_CASE(EEO_DONE):       goto out;

EEO_CASE(EEO_INNER_FETCHSOME):       /* XXX: worthwhile to check tts_nvalid inline first? */
slot_getsomeattrs(innerslot,op->d.fetch.last_var);       EEO_DISPATCH(op);
 

EEO_CASE(EEO_OUTER_FETCHSOME):       slot_getsomeattrs(outerslot, op->d.fetch.last_var);       EEO_DISPATCH(op);

EEO_CASE(EEO_SCAN_FETCHSOME):       slot_getsomeattrs(scanslot, op->d.fetch.last_var);       EEO_DISPATCH(op);

EEO_CASE(EEO_INNER_VAR):       {           int         attnum = op->d.var.attnum;
           /*            * Can't assert tts_nvalid, as wholerow var evaluation or such            * could have
materializedthe slot - but the contents are still            * valid :/            */           Assert(op->d.var.attnum
>=0);           *op->resnull = innerslot->tts_isnull[attnum];           *op->resvalue = innerslot->tts_values[attnum];
        EEO_DISPATCH(op);       }
 


which is a bit annoying.  (the EEO_CASE is either a jump label or a case
statement, depending on computed goto availability).

It seems we could either:
1) live with the damage
2) disable pgindent
3) move the : inside EEO_CASE's definition, and only use {} blocks.

I'm inclined to go for 3).

Opinions?

- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-13 22:37:51 +0100, Tomas Vondra wrote:
> On 03/13/2017 09:03 AM, Andres Freund wrote:
> > Hi,
> >
> > On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:
> > > I wanted to do a bit of testing and benchmarking on this, but 0004 seems to
> > > be a bit broken.
> >
> > Well, "broken" in the sense that it's already outdated, because other
> > stuff that got merged.
> >
>
> Yes, that's what I meant. Sorry for not being quite clear.

And the patch already again has conflicts. Updated version, with a good
chunk of other changes, attached.

> >
> > > The patch does not apply anymore - there are some conflicts
> > > in execQual.c, but I think I fixed those. But then I ran into a bunch of
> > > compile-time errors, because some of the executor nodes still reference bits
> > > that were moved elsewhere.
> >
> > Updated patch attached.  Note that this patch has two changes I've not
> > yet evaluated performance-wise.
> >
>
> And which are those?

Primarily around minor performance regressions for the evaluation of
very simple (i.e. single Var/Const) expressions.  I did a fair amount of
work hunting these down.


Besides that, this version has:
- pgindented most of the affected pieces (i.e. all significant new code
  has been reindent, not all touched one)
- Transported missing comments from execQual.c - afaics all relevant
  ones have been moved over
- Replaced ExprState->cb with evalfunc - that's what it was named
  previously, and I see no reason to diverge.
- significantly expanded header comments, especially in
  execInterpExpr.c, explaining the basic approach of expression
  evaluation.  Feedback here would be greatly appreciated - I've spent
  so much time neck deep into this, that I don't quite know in how much
  detail to go.
- Initial polished commit messages and such.


My current plan is to not do very much on this tomorrow, do another full
pass on Wednesday, and push it, unless there's protest till then.


> I plan to do a bit of testing / benchmarking on this later this week,
> depending on other testing already happening on my machine.

Cool.


I've performed a bunch more benchmarks, and the results have been pretty
favorable.  Due to some more work hunting down minor regressions I can't
make any tpc-h query perform worse than on master, with either
best/average of three executions.  I also did a bunch of runs with
parallelism, and the results there are also pretty good - even better
than non-parallel sometimes.

These results include parallelism, but were done *before* rebasing ontop
of Noah's changes.  There was also, during all runs, a browser and emacs
running, so it's certainly possible that there's some skew.

-par is with parallelism enabled to 8, with is without.  Otherwise same
setup as two mails up.

master-par q01 min: 4629.643 master min: 15108.813 [diff +69.36] dev min: 12629.0 [diff +63.34] dev-par min: 3458.899
[diff-33.85]
 
master-par q02 min: 1048.511 master min: 1196.596 [diff +12.38] dev min: 1182.652 [diff +11.34] dev-par min: 1044.876
[diff-0.35]
 
master-par q03 min: 4651.492 master min: 6949.525 [diff +33.07] dev min: 6441.467 [diff +27.79] dev-par min: 3928.919
[diff-18.39]
 
master-par q04 min: 698.037 master min: 981.081 [diff +28.85] dev min: 947.506 [diff +26.33] dev-par min: 341.855 [diff
-104.19]
master-par q05 min: 871.528 master min: 5119.728 [diff +82.98] dev min: 4979.498 [diff +82.50] dev-par min: 702.3 [diff
-24.10]
master-par q06 min: 1107.307 master min: 1807.433 [diff +38.74] dev min: 1694.513 [diff +34.65] dev-par min: 738.037
[diff-50.03]
 
master-par q07 min: 4867.412 master min: 4745.141 [diff -2.58] dev min: 4682.068 [diff -3.96] dev-par min: 4207.644
[diff-15.68]
 
master-par q08 min: 432.159 master min: 1465.219 [diff +70.51] dev min: 1416.691 [diff +69.50] dev-par min: 418.331
[diff-3.31]
 
master-par q09 min: 3965.76 master min: 6441.099 [diff +38.43] dev min: 6220.095 [diff +36.24] dev-par min: 3875.673
[diff-2.32]
 
master-par q10 min: 1129.351 master min: 5818.318 [diff +80.59] dev min: 5406.125 [diff +79.11] dev-par min: 1107.759
[diff-1.95]
 
master-par q11 min: 247.598 master min: 291.061 [diff +14.93] dev min: 277.18 [diff +10.67] dev-par min: 241.983 [diff
-2.32]
master-par q12 min: 1532.855 master min: 4552.034 [diff +66.33] dev min: 4431.584 [diff +65.41] dev-par min: 1395.703
[diff-9.83]
 
master-par q13 min: 8638.503 master min: 8009.64 [diff -7.85] dev min: 7891.661 [diff -9.46] dev-par min: 8314.437
[diff-3.90]
 
master-par q14 min: 723.878 master min: 764.35 [diff +5.29] dev min: 730.526 [diff +0.91] dev-par min: 717.66 [diff
-0.87]
master-par q15 min: 1991.915 master min: 1809.162 [diff -10.10] dev min: 1716.983 [diff -16.01] dev-par min: 1879.927
[diff-5.96]
 
master-par q16 min: 1596.643 master min: 1843.553 [diff +13.39] dev min: 1728.467 [diff +7.63] dev-par min: 1429.103
[diff-11.72]
 
master-par q17 min: 354.357 master min: 418.761 [diff +15.38] dev min: 414.958 [diff +14.60] dev-par min: 322.471 [diff
-9.89]
master-par q18 min: 11707.479 master min: 14216.567 [diff +17.65] dev min: 14206.893 [diff +17.59] dev-par min:
11642.863[diff -0.55]
 
master-par q19 min: 116.819 master min: 325.531 [diff +64.11] dev min: 287.497 [diff +59.37] dev-par min: 102.909 [diff
-13.52]
master-par q20 min: 430.85 master min: 500.92 [diff +13.99] dev min: 478.536 [diff +9.96] dev-par min: 410.548 [diff
-4.95]
master-par q22 min: 460.712 master min: 626.207 [diff +26.43] dev min: 595.708 [diff +22.66] dev-par min: 434.851 [diff
-5.95]

Thees seem like pretty neat results.  Sorry for not cleaning up the
representation - too tired.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Faster Expression Processing v4

From
Alvaro Herrera
Date:
Andres Freund wrote:

>     EEO_SWITCH(op->opcode)
>     {
> EEO_CASE(EEO_DONE):
>         goto out;

Oh my.

> which is a bit annoying.  (the EEO_CASE is either a jump label or a case
> statement, depending on computed goto availability).
> 
> It seems we could either:
> 1) live with the damage
> 2) disable pgindent
> 3) move the : inside EEO_CASE's definition, and only use {} blocks.

I think 3) is nasty because the result doesn't look like normal C.  If
EEO_CASE() are potentially jump labels, then indentation becomes
correct.  Why not accept it?  It looks odd, but that gives a better hint
to the reader about what's going on.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Alvaro Herrera
Date:
diff --git a/src/backend/executor/execUtils.c b/src/backend/executor/execUtils.c
index 3d6a3801c0..d205101b89 100644
--- a/src/backend/executor/execUtils.c
+++ b/src/backend/executor/execUtils.c
@@ -47,7 +47,14 @@#include "utils/rel.h"
-static bool get_last_attnums(Node *node, ProjectionInfo *projInfo);
+typedef struct LastLastAttnumInfo


LastLast?


Patch 0003 is huge.  It would be good to have someone at least read it
before pushing, but I don't think anyone other than you has done so.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Heikki Linnakangas
Date:
On 03/14/2017 08:53 AM, Andres Freund wrote:
> Besides that, this version has:
> - pgindented most of the affected pieces (i.e. all significant new code
>   has been reindent, not all touched one)

I think you'll need to add all the inner structs ExprEvalStep 
typedefs.list to indent them right.

> My current plan is to not do very much on this tomorrow, do another full
> pass on Wednesday, and push it, unless there's protest till then.

I looked at patch 0004. Some comments:

* EEO_DISPATCH_DIRECT(op) takes 'op' as an argument, but it really 
assumes that 'op' has already been set to point to the jump target. I 
find that a bit weird. I guess the idea is that you always pass the 
Program Counter variable as 'op' argument. For consistency, would be 
good if EEO_SWITCH() also took just 'op' as the argument, rather than 
op->opcode. But I think it would be more clear if they should both just 
assumed that there's a variable called 'op' that points to the current 
instruction.

* All the callers of EEO_DISPATCH_DIRECT(op) set 'op' just prior to 
calling EEO_DISPATCH_DIRECT(op). How about having a macro EEO_JUMP(<step 
number>), to encapsulate setting 'op' and jumping to it in one operation?

* ExecEvalStepOp() seems relatively expensive, with the linear scan over 
all the opcodes, if called on an ExprState that already has 
EEO_FLAG_JUMP_THREADED set. All the callers use it to check if the 
opcode is a particular one, so you could check if the opcode matches 
that particular one, instead of scanning the dispatch table to find what 
it is.

* But is ExecEvalStepOp() ever actually get called on an expression with 
EEO_FLAG_JUMP_THREADED already set? It's only used in 
ExecInstantiateInterpretedExpr(), and it's called only once on each 
ExprState. How about just adding an Assert that EEO_FLAG_JUMP_THREADED 
is not yet set? Or drop the EEO_FLAG_JUMP_THREADED flag altogether, and 
assert that evalfunc != ExecInterpExpr.

* How tight are we on space in the ExprEvalStep union? Currently, the 
jump-threading preparation replaces the opcodes with the goto labels, 
but it would be really nice to keep the original opcodes, for debugging 
purposes if nothing else.

* execInterpExpr.c is quite a mouthful. How about exprInterpreter.c?


Attached is a patch, on top of your 
0004-Faster-expression-evaluation-and-targetlist-projecti.patch, with 
some minor tweaks and comments here and there (search for HEIKKI). It's 
mostly the same stuff I listed above, implemented in a quick & dirty 
way. I hope it's helpful.

- Heikki


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,


On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote:
> On 03/14/2017 08:53 AM, Andres Freund wrote:
> > Besides that, this version has:
> > - pgindented most of the affected pieces (i.e. all significant new code
> >   has been reindent, not all touched one)
> 
> I think you'll need to add all the inner structs ExprEvalStep typedefs.list
> to indent them right.

Yea, I saw that.  Since they're not named atm, That'd probably not work
well.  I'd be inclined to just live with it :/


> I looked at patch 0004.

Thanks!


> * EEO_DISPATCH_DIRECT(op) takes 'op' as an argument, but it really assumes
> that 'op' has already been set to point to the jump target. I find that a
> bit weird. I guess the idea is that you always pass the Program Counter
> variable as 'op' argument. For consistency, would be good if EEO_SWITCH()
> also took just 'op' as the argument, rather than op->opcode. But I think it
> would be more clear if they should both just assumed that there's a variable
> called 'op' that points to the current instruction.

Hm.  I dislike magic variable names. I think I prefer your idea of
passing the op entirely to EEO_SWITCH.


> * All the callers of EEO_DISPATCH_DIRECT(op) set 'op' just prior to calling
> EEO_DISPATCH_DIRECT(op). How about having a macro EEO_JUMP(<step number>),
> to encapsulate setting 'op' and jumping to it in one operation?

Ok.


> * ExecEvalStepOp() seems relatively expensive, with the linear scan over all
> the opcodes, if called on an ExprState that already has
> EEO_FLAG_JUMP_THREADED set. All the callers use it to check if the opcode is
> a particular one, so you could check if the opcode matches that particular
> one, instead of scanning the dispatch table to find what it is.

It should be fairly cheap at that place, unless the same expression is
instantiated twice for some reason.  I have been wondering about adding
a second table ptr->op that can be binary searched for the operation to
make it cheaper.


> * But is ExecEvalStepOp() ever actually get called on an expression with
> EEO_FLAG_JUMP_THREADED already set? It's only used in
> ExecInstantiateInterpretedExpr(), and it's called only once on each
> ExprState. How about just adding an Assert that EEO_FLAG_JUMP_THREADED is
> not yet set? Or drop the EEO_FLAG_JUMP_THREADED flag altogether, and assert
> that evalfunc != ExecInterpExpr.

The primary use of ExecEvalStepOp() is really debugging, so I'd like to
avoid adding that restriction.  I guess we can add a second function for
this if needed.


> * How tight are we on space in the ExprEvalStep union? Currently, the
> jump-threading preparation replaces the opcodes with the goto labels, but it
> would be really nice to keep the original opcodes, for debugging purposes if
> nothing else.

There's nothing left to spare :(.  For debugging I've just used
ExecEvalStepOp() - it's inconvenient to directly print the struct
anyway, since gdb prints the whole union...


> * execInterpExpr.c is quite a mouthful. How about exprInterpreter.c?



> Attached is a patch, on top of your
> 0004-Faster-expression-evaluation-and-targetlist-projecti.patch, with some
> minor tweaks and comments here and there (search for HEIKKI). It's mostly
> the same stuff I listed above, implemented in a quick & dirty way. I hope
> it's helpful.

Thanks!



>  typedef enum ExprEvalOp
>  {
> +    /*
> +     * HEIKKI: How about renaming these to EEOP_* ? I think it'd be a
> +     * good mnemonic to remember that these are opcodes, and just one
> +     * extra letter.
> +     */

I don't mind the change.


> +    /* HEIKKI: Is it safe to assume that sizeof(size_t) >= sizeof(void *) ? */

Yes, seems very likely - it's supposed to hold any potential sizeof()
return value.  You could end up on platforms where that's not true, if
it used tagged pointers or such.  I guess we could instead use a union,
but that doesn't seem that much of a benefit.


Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-14 08:28:02 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> >     EEO_SWITCH(op->opcode)
> >     {
> > EEO_CASE(EEO_DONE):
> >         goto out;
> 
> Oh my.
> 
> > which is a bit annoying.  (the EEO_CASE is either a jump label or a case
> > statement, depending on computed goto availability).
> > 
> > It seems we could either:
> > 1) live with the damage
> > 2) disable pgindent
> > 3) move the : inside EEO_CASE's definition, and only use {} blocks.
> 
> I think 3) is nasty because the result doesn't look like normal C.

We have a bunch of such constructs already however. foreach(),
PG_TRY().  That means 3) has the advantage that our editors and pgindent
can already deal with it.


> EEO_CASE() are potentially jump labels, then indentation becomes
> correct.  Why not accept it?  It looks odd, but that gives a better hint
> to the reader about what's going on.

Seems likely that every editor would indent them differently :(


I'm leaning towards 3) for the reasons above and after trying out the
different indentations, but I don't have a strong opinion.


Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-14 08:44:24 -0300, Alvaro Herrera wrote:
> Patch 0003 is huge.

I suspect you mean 0004?  If so - yes :(.  I unfortunately don't think
there's a useful way to split it in smaller chunks - I originally moved
ops over one-by-one, but that required a lot of duplicated structs and
such...


> It would be good to have someone at least read it before pushing, but
> I don't think anyone other than you has done so.

I'd love for somebody else
to look through it, I tried asking multiple times...  Seems like
threatening a commit is the best way to get it :P


- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-14 08:44:24 -0300, Alvaro Herrera wrote:
>> It would be good to have someone at least read it before pushing, but
>> I don't think anyone other than you has done so.

> I'd love for somebody else
> to look through it, I tried asking multiple times...  Seems like
> threatening a commit is the best way to get it :P

I'm willing to look, but the last few messages make it sound like you're
not all that close to a finished version.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Heikki Linnakangas
Date:
On 03/14/2017 07:31 PM, Andres Freund wrote:
> On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote:
>> * How tight are we on space in the ExprEvalStep union? Currently, the
>> jump-threading preparation replaces the opcodes with the goto labels, but it
>> would be really nice to keep the original opcodes, for debugging purposes if
>> nothing else.
>
> There's nothing left to spare :(.  For debugging I've just used
> ExecEvalStepOp() - it's inconvenient to directly print the struct
> anyway, since gdb prints the whole union...

This comment above the union is not accurate:

>     /*
>      * Data for an operation. Inline stored data is faster to access, but also
>      * bloats the size of all instructions. The union should be kept below 48
>      * bytes (so the entire struct is below 64bytes, a single cacheline on
>      * common systems).
>      */

On my x86-64 system, the whole struct is 64 bytes, but the union is only 
40 bytes. The fields before the union occupy 24 bytes. IOW, the union 
should be kept below 40 bytes, not 48.

Hmm. Could we make the instructions variable size? It would allow 
packing the small instructions even more tight, and we wouldn't need to 
obsess over a particular maximum size for more complicated instructions.

A compromise might be to have the fixed size, but have "continuation" 
opcodes for the more complicated instructions. An XmlExpr instruction, 
for example, could have an extra instruction after the primary one, just 
to hold more fields. When evaluating it, we would just increment the 
Program Counter by two instead of one, to skip over the extra instruction.

For reference, here are the sizes of all the structs in the union:

40: xmlexpr
40: scalararrayop
40: minmax
40: iocoerce
40: convert_rowtype
32: wholerow
32: rowcompare_step
32: row
32: func
32: fieldstore
32: domaincheck
32: boolexpr
32: arrayexpr
32: arraycoerce
24: casewhen
24: casethen
24: arrayref_checksubscript
16: grouping_func
16: fieldselect
16: constval
16: casetest
16: assign_var
16: arrayref
8: window_func
8: subplan
8: sqlvaluefunction
8: param
8: nulltest_row
8: assign_tmp
8: alternative_subplan
8: aggref
4: var
4: rowcompare_final
4: qualexpr
4: fetch
4: coalesce

- Heikki




Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-14 20:28:51 +0200, Heikki Linnakangas wrote:
> On 03/14/2017 07:31 PM, Andres Freund wrote:
> > On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote:
> > > * How tight are we on space in the ExprEvalStep union? Currently, the
> > > jump-threading preparation replaces the opcodes with the goto labels, but it
> > > would be really nice to keep the original opcodes, for debugging purposes if
> > > nothing else.
> > 
> > There's nothing left to spare :(.  For debugging I've just used
> > ExecEvalStepOp() - it's inconvenient to directly print the struct
> > anyway, since gdb prints the whole union...
> 
> This comment above the union is not accurate:
> 
> >     /*
> >      * Data for an operation. Inline stored data is faster to access, but also
> >      * bloats the size of all instructions. The union should be kept below 48
> >      * bytes (so the entire struct is below 64bytes, a single cacheline on
> >      * common systems).
> >      */
> 
> On my x86-64 system, the whole struct is 64 bytes, but the union is only 40
> bytes. The fields before the union occupy 24 bytes. IOW, the union should be
> kept below 40 bytes, not 48.

Point.  Will update.


> Hmm. Could we make the instructions variable size? It would allow packing
> the small instructions even more tight, and we wouldn't need to obsess over
> a particular maximum size for more complicated instructions.

That makes jumps a lot more complicated.  I'd experimented with it and
given it up as "not worth it".  If we were to try to do so, we'd also
not like storing the pointer and enum variants both, since it'd again
would reduce the density.


> A compromise might be to have the fixed size, but have "continuation"
> opcodes for the more complicated instructions. An XmlExpr instruction, for
> example, could have an extra instruction after the primary one, just to hold
> more fields. When evaluating it, we would just increment the Program Counter
> by two instead of one, to skip over the extra instruction.

I think for cases where 40 bytes becomes the limit, it's usually better
to have out-of-line state.  We could do this, but we'd also waste a
bunch of space in the array (namely the 24 byte Op "header").

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Douglas Doole
Date:
Andres, sorry I haven't had a chance to look at this great stuff you've been doing. I've wanted to get to it, but work keeps getting in the way. ;-)

I do have one observation based on my experiments with your first version of the code. In my tests, I found that expression init becomes a lot more expensive in this new model. (That's neither a surprise, nor a concern.) In particular, the function ExprEvalPushStep() is quite hot. In my code I made the following changes:

  * Declare ExprEvalPushStep() "inline".
  * Remove the "if (es->steps_alloc == 0)" condition from ExprEvalPushStep().
  * In ExecInitExpr(), add:
       state->steps_alloc = 16;
       state->steps = palloc(sizeof(ExprEvalStep) * es->steps_alloc);

I found that this cut the cost of initializing the expression by about 20%. (Of course, that was on version 1 of your code, so the benefit may well be different now.)

On Tue, Mar 14, 2017 at 11:51 AM Andres Freund <andres@anarazel.de> wrote:
> Hmm. Could we make the instructions variable size? It would allow packing
> the small instructions even more tight, and we wouldn't need to obsess over
> a particular maximum size for more complicated instructions.

That makes jumps a lot more complicated.  I'd experimented with it and
given it up as "not worth it". 

Back when I was at IBM, I spent a lot of time doing stuff like this. If you want to commit with the fixed size arrays, I'm happy to volunteer to look at packing it tighter as a follow-on piece of work. (It was already on my list of things to try anyhow.)
 
If we were to try to do so, we'd also
not like storing the pointer and enum variants both, since it'd again
would reduce the density.
 
From my experience, it's worth the small loss in density to carry around both the pointer and the enum - it makes debugging so much easier.

- Doug
Salesforce

Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-14 22:03:45 +0000, Douglas Doole wrote:
> I do have one observation based on my experiments with your first version
> of the code. In my tests, I found that expression init becomes a lot more
> expensive in this new model. (That's neither a surprise, nor a
> concern.)

I suspect that's to a good degree because back then it'd often end up
trying the "new" stuff, but then fall back to the old machinery due to
unsupported operations. Which isn't a concern anymore, because now
there's full coverage.

Otherwise the cost aren't, according to my measurements, higher than
before, excepting that some stuff that happened at execution time is now
happening during initialization.


> In particular, the function ExprEvalPushStep() is quite hot. In my
> code I made the following changes:
> 
>   * Declare ExprEvalPushStep() "inline".
>   * Remove the "if (es->steps_alloc == 0)" condition
> from ExprEvalPushStep().
>   * In ExecInitExpr(), add:
>        state->steps_alloc = 16;
>        state->steps = palloc(sizeof(ExprEvalStep) * es->steps_alloc);
> 
> I found that this cut the cost of initializing the expression by about 20%.
> (Of course, that was on version 1 of your code, so the benefit may well be
> different now.)


Hm.  Right now ExprState's are allocated in several places - but we
could easily move that to a central place.  Have a bit of a hard time
seing that that branch during *initialization* time is that expensive,
especially given that previously we allocated a lot of memory separately
too.

> On Tue, Mar 14, 2017 at 11:51 AM Andres Freund <andres@anarazel.de> wrote:
> 
> > > Hmm. Could we make the instructions variable size? It would allow packing
> > > the small instructions even more tight, and we wouldn't need to obsess
> > over
> > > a particular maximum size for more complicated instructions.
> >
> > That makes jumps a lot more complicated.  I'd experimented with it and
> > given it up as "not worth it".

> Back when I was at IBM, I spent a lot of time doing stuff like this. If you
> want to commit with the fixed size arrays, I'm happy to volunteer to look
> at packing it tighter as a follow-on piece of work. (It was already on my
> list of things to try anyhow.)

Yea, I think experimenting with that is a good idea. I just think this
is complicated enough, and I don't want to add a whole lot more
complexity for very debatable benefit.


> > If we were to try to do so, we'd also
> > not like storing the pointer and enum variants both, since it'd again
> > would reduce the density.

> From my experience, it's worth the small loss in density to carry around
> both the pointer and the enum - it makes debugging so much easier.

I found it annoying enough to print the instruction itself, due to the
union - so printing the opcode using a function isn't too bad...

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Douglas Doole
Date:
On Tue, Mar 14, 2017 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:
Hm.  Right now ExprState's are allocated in several places - but we
could easily move that to a central place.  Have a bit of a hard time
seing that that branch during *initialization* time is that expensive,
especially given that previously we allocated a lot of memory separately
too.

I didn't make any comparisons of the cost of the new init against the old init with this change in particular - I just saw that it made the new init faster. I also didn't play around to determine if the savings was found in removing the branch misprediction or inlining or both.

I certainly wouldn't hold up your commit for this, but it's something that might be worth a second look once the dust has settled.

Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-14 16:58:54 +0200, Heikki Linnakangas wrote:
> * execInterpExpr.c is quite a mouthful. How about exprInterpreter.c?

I applied most of your changes in the attached version.

Notes:
* I quite like how the EEO_JUMP/EEO_NEXT stuff looks now / after your
  change, that's clearly better.
* I reverted the removal of EEO_SWITCH() - pgindent goes a bit bonkers
  if it can see the switch(), but not the case: statements.
* I re-added ExecEvalStepOp() - for one it's hugely useful during
  debugging, for another I'm actually using it in a follow patch (So JIT
  compilation can run after the interpreter has been initialized).
  Added code guaranteeing that we can't run
  ExecInstantiateInterpretedExpr() twice.
* Renamed EEO_* opcode enum members to EEOP_* as suggested
* I've not yet renamed execInterpExpr.c - I don't like execInterpreter.c
  because it doesn't reference expressions and execExprInterpreter.c
  seems a bit long - but I can live with both (preferring the
  latter). Not that surprisingly I can also live with execInterpExpr.c ;)
* Addressed your size_t sizing concern by using, as is proper anyway,
  intptr_t.
* I tried to clarify the comment in execExpr.c's header that you marked
  as hard to understand.

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-14 14:19:18 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-14 08:44:24 -0300, Alvaro Herrera wrote:
> >> It would be good to have someone at least read it before pushing, but
> >> I don't think anyone other than you has done so.
> 
> > I'd love for somebody else
> > to look through it, I tried asking multiple times...  Seems like
> > threatening a commit is the best way to get it :P
> 
> I'm willing to look

Cool.


> but the last few messages make it sound like you're not all that close
> to a finished version.

Hm, doesn't seem that far off to me - the latest few rounds of changes
have been largely polishing (naming, fighting w/ pgindent, language
polishing in comments, etc), otherwise I've basically just added a few
lines of optimization after doing more profiling.

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> [ new patch versions ]

About to leave, but I had time to read 0003:

It seems bizarre that you chose to spell the new configure symbol as
HAVE__COMPUTED_GOTO rather than HAVE_COMPUTED_GOTO, especially so
when the comment for PGAC_C_COMPUTED_GOTO alleges it's the latter.
Also, being a neatnik, I would insert both the definition and the
call of that macro between PGAC_C_BUILTIN_UNREACHABLE and
PGAC_C_VA_ARGS, keeping it more or less in alphabetical order.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-14 19:34:12 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > [ new patch versions ]
> 
> About to leave, but I had time to read 0003:
> 
> It seems bizarre that you chose to spell the new configure symbol as
> HAVE__COMPUTED_GOTO rather than HAVE_COMPUTED_GOTO

I went back-and-forth about this a number of times.  We have a bunch of
symbols defined with HAVE__ as a prefix (and some with HAVE_GCC__) - and
more of the nearby code seems to use __ rather than _.  I don't really
know why we started doing that, but it's far from new..

Any idea why we introduce __ stuff?


> Also, being a neatnik, I would insert both the definition and the
> call of that macro between PGAC_C_BUILTIN_UNREACHABLE and
> PGAC_C_VA_ARGS, keeping it more or less in alphabetical order.

That works for me.

Thanks,

Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Peter Eisentraut
Date:
On 3/14/17 19:40, Andres Freund wrote:
> Any idea why we introduce __ stuff?

Because the symbols start with an underscore:

/* Define to 1 if your compiler understands _Static_assert. */
#undef HAVE__STATIC_ASSERT

There is apparently some inconsistency when symbols start with more than
one underscore.

But we don't need to introduce underscores that are not there.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-14 23:10:25 -0400, Peter Eisentraut wrote:
> On 3/14/17 19:40, Andres Freund wrote:
> > Any idea why we introduce __ stuff?
> 
> Because the symbols start with an underscore:
> 
> /* Define to 1 if your compiler understands _Static_assert. */
> #undef HAVE__STATIC_ASSERT

Oh, I guess that makes some sense.  But we do so very inconsistently,
given we only keep the leading underscores not the trailing ones (see
HAVE__VA_ARGS, HAVE_FUNCNAME__FUNC).


> But we don't need to introduce underscores that are not there.

Indeed.  Don't want to repost for just this, but consider it adapted
(and moved as requested by Tom).

- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-14 19:34:12 -0400, Tom Lane wrote:
>> It seems bizarre that you chose to spell the new configure symbol as
>> HAVE__COMPUTED_GOTO rather than HAVE_COMPUTED_GOTO

> I went back-and-forth about this a number of times.  We have a bunch of
> symbols defined with HAVE__ as a prefix (and some with HAVE_GCC__) - and
> more of the nearby code seems to use __ rather than _.  I don't really
> know why we started doing that, but it's far from new..

> Any idea why we introduce __ stuff?

The nearby stuff is describing features that have a specific name that
includes a leading underscore or two, like __builtin_unreachable().
That doesn't seem to apply here, so I wouldn't add extra underscores.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> [ latest patches ]

I looked through 0002-Make-get_last_attnums-more-generic.patch.
Although it seems relatively unobjectionable on its own, I'm not
convinced that it's really useful to try to split it out like this.
I see that 0004 removes the only call of ExecGetLastAttnums (the
one in ExecBuildProjectionInfo) and then adds a single call in
ExecInitExprSlots which is in execExpr.c.  To me, the only reason
ExecGetLastAttnums nee get_last_attnums is in execUtils.c in the first
place is that it is a private subroutine of ExecBuildProjectionInfo.
After these changes, it might as well be a private subroutine of
ExecInitExprSlots.  I'm suspicious of turning it into a globally
accessible function as you've done here, because I doubt that it is of
global use --- in particular, the fact that it doesn't deal specially
with INDEX_VAR Vars seems rather specific to this one use-case.

So for my money, you should drop 0002 altogether and just have 0004
remove get_last_attnums() from execUtils.c and stick it into
execExpr.c.  I suppose you still need the LastAttnumInfo API change
so as to decouple it from ProjectionInfo, but that's minor.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:

On March 15, 2017 12:33:28 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Andres Freund <andres@anarazel.de> writes:
>> [ latest patches ]
>
>I looked through 0002-Make-get_last_attnums-more-generic.patch.

>So for my money, you should drop 0002 altogether and just have 0004
>remove get_last_attnums() from execUtils.c and stick it into
>execExpr.c.  I suppose you still need the LastAttnumInfo API change
>so as to decouple it from ProjectionInfo, but that's minor.

I think it's quite useful in other places too, we do repeated slot-getattrs in a bunch of places in the executor, and
it'snoticeable performance wise.  

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On March 15, 2017 12:33:28 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> So for my money, you should drop 0002 altogether and just have 0004
>> remove get_last_attnums() from execUtils.c and stick it into
>> execExpr.c.  I suppose you still need the LastAttnumInfo API change
>> so as to decouple it from ProjectionInfo, but that's minor.

> I think it's quite useful in other places too, we do repeated slot-getattrs in a bunch of places in the executor, and
it'snoticeable performance wise.  

Color me dubious.  Which specific other places have you got in mind, and
do they have expression trees at hand that would tell them which columns
they really need to pull out?
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-15 15:41:22 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On March 15, 2017 12:33:28 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> So for my money, you should drop 0002 altogether and just have 0004
> >> remove get_last_attnums() from execUtils.c and stick it into
> >> execExpr.c.  I suppose you still need the LastAttnumInfo API change
> >> so as to decouple it from ProjectionInfo, but that's minor.
> 
> > I think it's quite useful in other places too, we do repeated slot-getattrs in a bunch of places in the executor,
andit's noticeable performance wise. 
 
> 
> Color me dubious.  Which specific other places have you got in mind, and
> do they have expression trees at hand that would tell them which columns
> they really need to pull out?

I was thinking of execGrouping.c's execTuplesMatch(),
TupleHashTableHash() (and unequal, but doubt that matters
performancewise).  There's also nodeHash.c's ExecHashGetValue(), but I
think that'd possibly better fixed differently.

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-15 15:41:22 -0400, Tom Lane wrote:
>> Color me dubious.  Which specific other places have you got in mind, and
>> do they have expression trees at hand that would tell them which columns
>> they really need to pull out?

> I was thinking of execGrouping.c's execTuplesMatch(),
> TupleHashTableHash() (and unequal, but doubt that matters
> performancewise).  There's also nodeHash.c's ExecHashGetValue(), but I
> think that'd possibly better fixed differently.

The execGrouping.c functions don't have access to an expression tree
instructing them which columns to pull out of the tuple, so I fail to see
how get_last_attnums() would be of any use to them.  As for
ExecHashGetHashValue, it's most likely going to be working from virtual
tuples passed up to the join, which won't benefit from predetermination
of the last column to be accessed.  The tuple-deconstruction would have
happened while projecting in the scan node below.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-15 16:07:14 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-15 15:41:22 -0400, Tom Lane wrote:
> >> Color me dubious.  Which specific other places have you got in mind, and
> >> do they have expression trees at hand that would tell them which columns
> >> they really need to pull out?
>
> > I was thinking of execGrouping.c's execTuplesMatch(),
> > TupleHashTableHash() (and unequal, but doubt that matters
> > performancewise).  There's also nodeHash.c's ExecHashGetValue(), but I
> > think that'd possibly better fixed differently.
>
> The execGrouping.c functions don't have access to an expression tree
> instructing them which columns to pull out of the tuple, so I fail to see
> how get_last_attnums() would be of any use to them.

I presume most of the callers do.  We'd have to change the API somewhat,
unless we just have a small loop in execTuplesMatch() determining the
biggest column index (which might be worthwhile / acceptable).
TupleHashTableHash() should be able to have that pre-computed in
BuildTupleHashTable().  Might be more viable to go that way.


> As for ExecHashGetHashValue, it's most likely going to be working from
> virtual tuples passed up to the join, which won't benefit from
> predetermination of the last column to be accessed.  The
> tuple-deconstruction would have happened while projecting in the scan
> node below.

I think the physical tuple stuff commonly thwarts that argument?  On
master for tpch's Q5 you can e.g. see the following profile (master):

+   29.38%  postgres  postgres          [.] ExecScanHashBucket
+   16.72%  postgres  postgres          [.] slot_getattr
+    5.51%  postgres  postgres          [.] heap_getnext
-    5.50%  postgres  postgres          [.] slot_deform_tuple  - 98.07% slot_deform_tuple     - 85.98% slot_getattr
  - 96.59% ExecHashGetHashValue           - ExecHashJoin              - ExecProcNode                 + 85.12%
ExecHashJoin                + 14.88% MultiExecHash        + 3.41% ExecMakeFunctionResultNoSets     + 14.02%
slot_getsomeattrs + 1.58% ExecEvalScalarVarFast
 

I.e. nearly all calls for slot_deform_tuple are from slot_getattrs in
ExecHashGetHashValue().  And nearly all the time in slot_getattr is
spent on code only executed for actual tuples:
      │               if (tuple == NULL)                      /* internal error */ 0.18 │         test   %rax,%rax
│      ↓ je     223      │                *      │                * (We have to check this separately because of
variousinheritance and      │                * table-alteration scenarios: the tuple could be either longer or shorter
   │                * than the tupdesc.)      │                */      │               tup = tuple->t_data; 0.47 │
  mov    0x10(%rax),%rsi      │               if (attnum > HeapTupleHeaderGetNatts(tup))75.42 │         movzwl
0x12(%rsi),%eax0.70 │         and    $0x7ff,%eax 0.47 │         cmp    %eax,%ebx      │       ↓ jg     e8
 

- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-15 16:07:14 -0400, Tom Lane wrote:
>> As for ExecHashGetHashValue, it's most likely going to be working from
>> virtual tuples passed up to the join, which won't benefit from
>> predetermination of the last column to be accessed.  The
>> tuple-deconstruction would have happened while projecting in the scan
>> node below.

> I think the physical tuple stuff commonly thwarts that argument?  On
> master for tpch's Q5 you can e.g. see the following profile (master):

Hmmm ... I think you're mistaken in fingering the physical-tuple
optimization per se, but maybe skipping ExecProject at the scan level
would cause this result?

I've thought for some time that it was dumb to have the executor
reverse-engineering this info at plan startup anyway.  We could make the
planner mark each table scan node with the highest column number that the
plan will access, and use that to drive a slot_getsomeattrs call in
advance of any access to tuple contents.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-15 17:33:46 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-15 16:07:14 -0400, Tom Lane wrote:
> >> As for ExecHashGetHashValue, it's most likely going to be working from
> >> virtual tuples passed up to the join, which won't benefit from
> >> predetermination of the last column to be accessed.  The
> >> tuple-deconstruction would have happened while projecting in the scan
> >> node below.
> 
> > I think the physical tuple stuff commonly thwarts that argument?  On
> > master for tpch's Q5 you can e.g. see the following profile (master):
> 
> Hmmm ... I think you're mistaken in fingering the physical-tuple
> optimization per se, but maybe skipping ExecProject at the scan level
> would cause this result?

I think those are often related (i.e. we replace a smaller targetlist
with a "physical" one, which then allows to skip ExecProject()).


> I've thought for some time that it was dumb to have the executor
> reverse-engineering this info at plan startup anyway.

Yea, it'd be good if this (and some similar tasks like building interim
tuple descriptors) could be moved to the planner.  But:

> We could make the planner mark each table scan node with the highest
> column number that the plan will access, and use that to drive a
> slot_getsomeattrs call in advance of any access to tuple contents.

probably isn't sufficient - we build non-virtual tuples in a good number
of places (sorts, tuplestore using stuff like nodeMaterial, nodeHash.c
output, ...).  I suspect it'd have measurable negative consequences if
we removed the deforming logic for all expressions/projections above
such nodes.  I guess we could just do such a logic for every Plan node?

- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-15 17:33:46 -0400, Tom Lane wrote:
>> We could make the planner mark each table scan node with the highest
>> column number that the plan will access, and use that to drive a
>> slot_getsomeattrs call in advance of any access to tuple contents.

> probably isn't sufficient - we build non-virtual tuples in a good number
> of places (sorts, tuplestore using stuff like nodeMaterial, nodeHash.c
> output, ...).  I suspect it'd have measurable negative consequences if
> we removed the deforming logic for all expressions/projections above
> such nodes.  I guess we could just do such a logic for every Plan node?

[ scratches head... ]  What deforming logic do you think I'm proposing
removing?
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-15 18:16:57 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-15 17:33:46 -0400, Tom Lane wrote:
> >> We could make the planner mark each table scan node with the highest
> >> column number that the plan will access, and use that to drive a
> >> slot_getsomeattrs call in advance of any access to tuple contents.
> 
> > probably isn't sufficient - we build non-virtual tuples in a good number
> > of places (sorts, tuplestore using stuff like nodeMaterial, nodeHash.c
> > output, ...).  I suspect it'd have measurable negative consequences if
> > we removed the deforming logic for all expressions/projections above
> > such nodes.  I guess we could just do such a logic for every Plan node?
> 
> [ scratches head... ]  What deforming logic do you think I'm proposing
> removing?

I thought you were suggesting that we don't do the get_last_attnums (and
inlined version in the isSimpleVar case) at execution time anymore,
instead relying on logic in the planner to know how much to deform ahead
of time.  Then we'd do slot_getsomeattrs in the appropriate places.  But
I understood you suggesting to do so only in scan nodes - which doesn't
seem sufficient, due to the use of materialized / minimal tuples in
other types of nodes.  Did I misunderstand?

- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-15 18:16:57 -0400, Tom Lane wrote:
>> [ scratches head... ]  What deforming logic do you think I'm proposing
>> removing?

> I thought you were suggesting that we don't do the get_last_attnums (and
> inlined version in the isSimpleVar case) at execution time anymore,
> instead relying on logic in the planner to know how much to deform ahead
> of time.  Then we'd do slot_getsomeattrs in the appropriate places.  But
> I understood you suggesting to do so only in scan nodes - which doesn't
> seem sufficient, due to the use of materialized / minimal tuples in
> other types of nodes.  Did I misunderstand?

We would need to do it anywhere that we might be projecting from a
materialized tuple, I suppose.  From the planner's standpoint it would be
about as easy to do this for all plan nodes as only selected ones.

Anyway the core point here seems to be that skipping ExecProject misses a
bet because the underlying tuple doesn't get disassembled.  A quick-hack
way of seeing if this helps might be to do slot_getallattrs in the places
where we skip ExecProject.  I'm not sure we'd want that as a permanent
solution, because it's possible that it extracts columns not actually
needed, but it should at least move the hotspot in your example case.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> [ new patches ]

I've started to look at 0004, and the first conclusion I've come to
is that it's *direly* short of documentation.  To the point that I'd
vote against committing it if something isn't done about that.  As
an example, it's quite unclear how ExprEvalSteps acquire correct
resnull/resvalue pointers, and the algorithm for that seems nontrivial.
It doesn't help any that the arguments of ExecInitExprRec are entirely
undocumented.

I think it would be worth creating a README file giving an overview
of how all of this patch is supposed to work.  You also need to do a
whole lot more work on the function-level comments.

A specific thing I noticed in the particular area of
what-gets-returned-where is this bit in EEOP_SCALARARRAYOP setup:

+                /*
+                 * Evaluate array argument into our return value, overwrite
+                 * with comparison results afterwards.
+                 */
+                ExecInitExprRec((Expr *) lsecond(opexpr->args), parent, state,
+                                resv, resnull);

That scares me quite a bit, because it smells exactly like the sort of
premature optimization that bit us on the rear in CVE-2016-5423 (cf commit
f0c7b789a).  What's it really buying us to overwrite the return value
early rather than storing into the fcinfo's second argument slot?
(The memory of that CVE is part of what's prompting me to demand a clear
explanation of the algorithm for deciding where steps return their
results.  Even if this particular code is safe, somebody is going to do
something unsafe in future if there's not a documented rule to follow.)

Another thing that ties into the do-I-understand-this-at-all question
is this bit:

+        EEO_CASE(EEOP_BOOL_AND_STEP_FIRST)
+        {
+            *op->d.boolexpr.anynull = false;
+
+            /*
+             * Fallthrough (can't be last - ANDs have two arguments at least).
+             */
+        }
+
+        EEO_CASE(EEOP_BOOL_AND_STEP)

It seems like this is missing an "op++;" before falling through.  If it
isn't, because really BOOL_AND_STEP_FIRST is defined as clearing anynull
and then also doing a regular BOOL_AND_STEP, then the comment seems rather
off point.  It should be more like "Fall through to do regular AND step
processing as well".  The place where the comment would be on point
is probably over here:

+                        case AND_EXPR:
+                            /*
+                             * ANDs have at least two arguments, so that
+                             * no step needs to be both FIRST and LAST.
+                             */
+                            Assert(list_length(boolexpr->args) >= 2);
+
+                            if (off == 0)
+                                scratch.opcode = EEOP_BOOL_AND_STEP_FIRST;
+                            else if (off + 1 == nargs)
+                                scratch.opcode = EEOP_BOOL_AND_STEP_LAST;
+                            else
+                                scratch.opcode = EEOP_BOOL_AND_STEP;
+                            break;

although I think the Assert ought to be examining nargs not
list_length(boolexpr->args) so that it has some visible connection to the
code after it.  (This all applies to OR processing as well, of course.)

BTW, it sure seems like ExecInitExprRec and related code ought to set
off all sorts of valgrind complaints?  It's not being careful at all
to ensure that all fields of the "scratch" record get initialized before
we memcpy it to someplace.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-15 18:48:28 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-15 18:16:57 -0400, Tom Lane wrote:
> >> [ scratches head... ]  What deforming logic do you think I'm proposing
> >> removing?
> 
> > I thought you were suggesting that we don't do the get_last_attnums (and
> > inlined version in the isSimpleVar case) at execution time anymore,
> > instead relying on logic in the planner to know how much to deform ahead
> > of time.  Then we'd do slot_getsomeattrs in the appropriate places.  But
> > I understood you suggesting to do so only in scan nodes - which doesn't
> > seem sufficient, due to the use of materialized / minimal tuples in
> > other types of nodes.  Did I misunderstand?
> 
> We would need to do it anywhere that we might be projecting from a
> materialized tuple, I suppose.  From the planner's standpoint it would be
> about as easy to do this for all plan nodes as only selected ones.

Yea.


> Anyway the core point here seems to be that skipping ExecProject misses a
> bet because the underlying tuple doesn't get disassembled.  A quick-hack
> way of seeing if this helps might be to do slot_getallattrs in the places
> where we skip ExecProject.
> I'm not sure we'd want that as a permanent
> solution, because it's possible that it extracts columns not actually
> needed, but it should at least move the hotspot in your example case.

Hm, that could hurt pretty badly if we actually only access the first
few columns.

I wonder if we shouldn't essentially get rid of all slot_getattr()
calls, and replace them with one slot_getsomeattrs() and then direct
tts_values/nulls access.  Where we compute the column number is then
essentially a separate discussion.


Looking around most of them seem to access multiple columns, and several
of them probably can't conveniently done via the planner:

src/backend/catalog/partition.c
1635:                   datum = slot_getattr(slot, keycol, &isNull);

acesses all the partitioned columns and has convenient location to
compute column to deform to (RelationGetPartitionDispatchInfo).


src/backend/catalog/index.c
1797:                   iDatum = slot_getattr(slot, keycol, &isNull);

acesses all the partitioned columns and has convenient location to
compute column to deform to (BuildIndexInfo).


src/backend/utils/adt/orderedsetaggs.c
1196:           Datum           d = slot_getattr(slot, nargs + 1, &isnull);
1359:           Datum           d = slot_getattr(slot, nargs + 1, &isnull);

no benefit.


src/backend/executor/nodeMergeAppend.c
246:            datum1 = slot_getattr(s1, attno, &isNull1);
247:            datum2 = slot_getattr(s2, attno, &isNull2);

accesses all columns in the sort key, convenient place to prepare
(ExecInitMergeAppend).


src/backend/executor/execQual.c
594:     * caught inside slot_getattr).  What we have to check for here is the
600:     * Note: we allow a reference to a dropped attribute.  slot_getattr will
637:    return slot_getattr(slot, attnum, isNull);
676:    return slot_getattr(slot, attnum, isNull);
1681:                           return slot_getattr(fcache->funcResultSlot, 1, isNull);

Already converted in proposed expression evaluation patch.


src/backend/executor/nodeSubplan.c
367:                    dvalue = slot_getattr(slot, 1, &disnull);
395:                    prmdata->value = slot_getattr(slot, col, &(prmdata->isnull));
565:                    prmdata->value = slot_getattr(slot, col,
1015:                   dvalue = slot_getattr(slot, 1, &disnull);

Accesses all params, convenient location to compute column number
(ExecInitSubPlan).


src/backend/executor/execCurrent.c
186:            /* Use slot_getattr to catch any possible mistakes */
188:                    DatumGetObjectId(slot_getattr(scanstate->ss_ScanTupleSlot,
193:                    DatumGetPointer(slot_getattr(scanstate->ss_ScanTupleSlot,

Accesses only system columns.


src/backend/executor/nodeSetOp.c
107:    flag = DatumGetInt32(slot_getattr(inputslot,

Not an actual column.


src/backend/executor/nodeGatherMerge.c
678:            datum1 = slot_getattr(s1, attno, &isNull1);
679:            datum2 = slot_getattr(s2, attno, &isNull2);

Same as mergeAppend (huh, why did we copy this code), i.e. beneficial.


src/backend/executor/functions.c
970:            value = slot_getattr(slot, 1, &(fcinfo->isnull));

no benefit.


src/backend/executor/execJunk.c
253:    return slot_getattr(slot, attno, isNull);

no benefit.


src/backend/executor/nodeNestloop.c
136:                            prm->value = slot_getattr(outerTupleSlot,

accesses all future param values, convenient place to compute column
number.


src/backend/executor/execGrouping.c
100:            attr1 = slot_getattr(slot1, att, &isNull1);
102:            attr2 = slot_getattr(slot2, att, &isNull2);
170:            attr1 = slot_getattr(slot1, att, &isNull1);
175:            attr2 = slot_getattr(slot2, att, &isNull2);
501:            attr = slot_getattr(slot, att, &isNull);

accesses all grouped-on values, but only some have a convenient place to
compute column number.


src/backend/access/common/printtup.c
545:            attr = slot_getattr(slot, i + 1, &isnull);

Debug routine.


contrib/postgres_fdw/postgres_fdw.c
3213:                   value = slot_getattr(slot, attnum, &isnull);

Computes all parameter values, convenient-ish place to compute colun
number.

- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-15 20:09:03 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > [ new patches ]
>
> I've started to look at 0004, and the first conclusion I've come to
> is that it's *direly* short of documentation.  To the point that I'd
> vote against committing it if something isn't done about that.

Yea, I asked for input about what's hard to understand and what's not -
I stared at this for a *lot* of time, and it all kinda looks easy-ish
now.  I'm more than willing to expand on the below, and other pieces.


> As  an example, it's quite unclear how ExprEvalSteps acquire correct
> resnull/resvalue pointers, and the algorithm for that seems nontrivial.
> It doesn't help any that the arguments of ExecInitExprRec are entirely
> undocumented.

Generally whatever wants the result of a (sub-)expression passes in the
desired resvalue/resnull.  E.g. when doing a function call the
individual arguments are each prepared for evaluation using
ExecInitExprRec() and resvalue/resnull are pointing into fcinfo's
arg/nulls[i].


> I think it would be worth creating a README file giving an overview
> of how all of this patch is supposed to work.  You also need to do a
> whole lot more work on the function-level comments.

Ok.


> A specific thing I noticed in the particular area of
> what-gets-returned-where is this bit in EEOP_SCALARARRAYOP setup:
>
> +                /*
> +                 * Evaluate array argument into our return value, overwrite
> +                 * with comparison results afterwards.
> +                 */
> +                ExecInitExprRec((Expr *) lsecond(opexpr->args), parent, state,
> +                                resv, resnull);
>
> That scares me quite a bit, because it smells exactly like the sort of
> premature optimization that bit us on the rear in CVE-2016-5423 (cf commit
> f0c7b789a).

I don't think there's a danger similar to f0c7b789a here, because the
"caller" (i.e. the node that needs the expression's result) expects
resvalue/null to be overwritten. It'll e.g. be the value "slot" of one
arm (is there a better name for one part of a boolean expression?) of a
boolean expression.


> What's it really buying us to overwrite the return value
> early rather than storing into the fcinfo's second argument slot?

That'd work just as well.


> (The memory of that CVE is part of what's prompting me to demand a clear
> explanation of the algorithm for deciding where steps return their
> results.  Even if this particular code is safe, somebody is going to do
> something unsafe in future if there's not a documented rule to follow.)

I don't think there's a danger here, but I think you more generally have
a point.


> Another thing that ties into the do-I-understand-this-at-all question
> is this bit:
>
> +        EEO_CASE(EEOP_BOOL_AND_STEP_FIRST)
> +        {
> +            *op->d.boolexpr.anynull = false;
> +
> +            /*
> +             * Fallthrough (can't be last - ANDs have two arguments at least).
> +             */
> +        }
> +
> +        EEO_CASE(EEOP_BOOL_AND_STEP)
>
> It seems like this is missing an "op++;" before falling through.  If it
> isn't, because really BOOL_AND_STEP_FIRST is defined as clearing anynull
> and then also doing a regular BOOL_AND_STEP, then the comment seems rather
> off point.

It's intended to fall through this way, i.e. the difference between
_FIRST and not is just that only the former clears anynull.  What the
comment is about, admittedly too cryptically, is that the _LAST step
that then evaluates anynull cannot be the same step as
EEOP_BOOL_AND_STEP_FIRST, because bool AND/OR always has at least two
"arms".  Will expand / move.


> BTW, it sure seems like ExecInitExprRec and related code ought to set
> off all sorts of valgrind complaints?  It's not being careful at all
> to ensure that all fields of the "scratch" record get initialized before
> we memcpy it to someplace.

It worked not long ago - valgrind's replacment memcpy() doesn't trigger
undefined memory warnings, it just copies the "definedness" of each byte
(or bit?).  But your point gives me an idea: It seems like a good idea
to VALGRIND_MAKE_MEM_UNDEFINED() the "scratch" step at some convenient
places, so the definedness of individual operations is more useful.


Thanks for the look!


- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andreas Karlsson
Date:
Hi,

I got a test failure with this version of the patch in the postges_fdw. 
It looks to me like it was caused by a typo in the source code which is 
fixed in the attached patch.

After applying this patch check-world passes.

Andreas

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-15 20:09:03 -0400, Tom Lane wrote:
>> That scares me quite a bit, because it smells exactly like the sort of
>> premature optimization that bit us on the rear in CVE-2016-5423 (cf commit
>> f0c7b789a).

> I don't think there's a danger similar to f0c7b789a here, because the
> "caller" (i.e. the node that needs the expression's result) expects
> resvalue/null to be overwritten.

Yeah, that's what I thought when I wrote the broken code in ExecEvalCase,
too.  It was wrong.  Basically you've got to be sure that no aliasing
can occur, and I think the only way to be safe about that is to have a
very clear rule about where results are allowed to get returned to,
preferably one that doesn't ever re-use the same target.  (I think the
repeated use of the same subexpression result address for the arms of
an AND or OR is okay, but it would be a good idea to have a clear
statement of why.)

The thing that actually made the ExecEvalCase code into a bug was that
we were using ExprContext-level fields to store the current caseValue,
allowing aliasing to occur across nested CASEs.  I think that in
this implementation, it ought to be possible to get rid of
ExprContext.caseValue_datum et al altogether, in favor of some storage
location that's private to each CASE expression.  I'm a bit disappointed
that that doesn't seem to have happened.

Eventually, I would also like to find a way to remove the restriction
imposed by the other part of f0c7b789a, ie that we can't inline a SQL
function when that would result in intermixing two levels of CASE
expression.  An implementation along the lines of what I've sketched
above could handle that easily enough, as long as we could identify
which nested level of CASE a particular CaseTestExpr belongs to.
I don't know how to do that offhand, but it seems like it ought to be
soluble if we put a bit of thought into it.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
I wrote:
> Andres Freund <andres@anarazel.de> writes:
>> I don't think there's a danger similar to f0c7b789a here, because the
>> "caller" (i.e. the node that needs the expression's result) expects
>> resvalue/null to be overwritten.

> Yeah, that's what I thought when I wrote the broken code in ExecEvalCase,
> too.  It was wrong.

Along the same line, I notice that you've got some expr step types
overwriting their own input, the various flavors of EEOP_BOOLTEST for
example.  Maybe that's all right but it doesn't really give me a warm
feeling, especially when other single-argument operations like
EEOP_BOOL_NOT_STEP are done differently.  Again, I think a clear
explanation of the design is essential to allow people to reason about
whether this sort of trickery is safe.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> [ latest patches ]

I looked through 0001 (the composite-type-dependencies one).  Although
I agree that it'd be good to tighten things up in that area, I do not
think we want this specific patch: it tightens things too much.  Consider
this variant of the existing test case in create_view.sql:

create table tt (f1 int, f2 int, f3 int);
create function ttf() returns setof tt as 'select * from tt' language sql;
create view vv as select f3 from ttf();
alter table tt drop column f3;

Current code allows the above (and causes the view to return nulls
afterwards).  The 0001 patch would forbid the DROP, which is fine,
but it would also forbid dropping either of the other two table
columns, which I think is insupportable.

Also, given the above table and function, consider

create view vvv as select ttf();

This view returns a 3-column composite type.  Now do

alter table tt add column f4 int;

Now the view returns a 4-column composite type.  But at this point the
patch will let you drop the f4 column, but not any of the earlier three.
That's just weird.

So I'm unhappy with the specific decisions made in 0001.  I think what we
really want there, probably, is for find_expr_references_walker to do more
than nothing with Vars referencing non-RELATION RTEs.

Having said all that, I think that 0001 is contributing very little to the
goals of this patch set.  Andres stated that he wanted it so as to drop
some of the one-time checks that execQual.c currently does for Vars, but
I'm not really convinced that we could do that safely even with these
additional dependencies in place.  Moreover, I really doubt that there's
a horrible performance cost from including something like
if (unlikely(op->first_execution))    out_of_line_checking_subroutine(...);

in the execution code for Vars.  And that certainly isn't adding any
complexity for JIT compilation that we don't face anyway for other
execution step types.

So my recommendation is to drop 0001 and include the same one-time
checks that execQual.c currently has as out-of-line one-time checks
in the new code.  We can revisit that later, but time grows short for
v10.  I would much rather have a solid version of 0004 and not 0001,
than not have anything for v10 because we spent too much time dealing
with adding new dependencies.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-17 11:36:30 -0400, Tom Lane wrote:
> Having said all that, I think that 0001 is contributing very little to the
> goals of this patch set.  Andres stated that he wanted it so as to drop
> some of the one-time checks that execQual.c currently does for Vars, but
> I'm not really convinced that we could do that safely even with these
> additional dependencies in place.  Moreover, I really doubt that there's
> a horrible performance cost from including something like
> 
>     if (unlikely(op->first_execution))
>         out_of_line_checking_subroutine(...);

> in the execution code for Vars.

But it actually does (well, for some relatively small value of horrible)
The issue is that op->first_execution is actually badly predictable,
because it will be set back/forth between executions of different
expressions (in the same plantree).  Obviously you'll not notice if you
have a Var and then some expensive stuff, but it's noticeable for
cheap-ish expressions (say e.g. a single Var).  So the branch prediction
often doesn't handle this gracefully - it also just expands the set of
to-be-tracked jumps.

If we had a decent way to actually check this during ExecInitExpr() (et
al), the whole discussion would be different - I'd be all expanding the
set of such checks even.  But I couldn't find a decent way to get there
- when expressions are initialized we don't even get an ExprContext (not
to speak of valid slots), nor is parent->plan very helpful.


That said, it seems this is something that has to wait for a later
release, I'm putting back in similar logic as there was before (not a
branch, but change the opcode to a non-checking variant).


> And that certainly isn't adding any
> complexity for JIT compilation that we don't face anyway for other
> execution step types.

Obviously a if (op->first_execution) isn't an issue, it's actually only
doing the first time through that's not easily possible.



> So my recommendation is to drop 0001 and include the same one-time
> checks that execQual.c currently has as out-of-line one-time checks
> in the new code.  We can revisit that later, but time grows short for
> v10.  I would much rather have a solid version of 0004 and not 0001,
> than not have anything for v10 because we spent too much time dealing
> with adding new dependencies.

Doing that (+README).

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> That said, it seems this is something that has to wait for a later
> release, I'm putting back in similar logic as there was before (not a
> branch, but change the opcode to a non-checking variant).

Yeah, I was wondering if changing the opcode would be preferable to
a first-time flag.

>> So my recommendation is to drop 0001 and include the same one-time
>> checks that execQual.c currently has as out-of-line one-time checks
>> in the new code.  We can revisit that later, but time grows short for
>> v10.  I would much rather have a solid version of 0004 and not 0001,
>> than not have anything for v10 because we spent too much time dealing
>> with adding new dependencies.

> Doing that (+README).

OK.  I believe that we can get this committed after the documentation
problems are sorted.  I noticed a lot of small things that bugged me,
mostly sloppy comments, but I think that the most efficient way to
handle those is for me to make an editorial pass on your next version.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,


On 2017-03-16 11:15:16 -0400, Tom Lane wrote:
> The thing that actually made the ExecEvalCase code into a bug was that
> we were using ExprContext-level fields to store the current caseValue,
> allowing aliasing to occur across nested CASEs.  I think that in
> this implementation, it ought to be possible to get rid of
> ExprContext.caseValue_datum et al altogether, in favor of some storage
> location that's private to each CASE expression.  I'm a bit disappointed
> that that doesn't seem to have happened.

The patch actually does so - during ExecInitExprRec the "relevant"
case/domain testval is stored in ExprState->innermost_*, those pointers
are then stored directly in the relevant steps for
CaseTest/CoerceToDomainValue evaluation.  Unfortunately
CaseTest/CoerceToDomainValue are reused outside of domain / case
expressions in a bunch of places (plpgsql uses CaseTest for casts
evaluation, validateDomainConstraint/domain_check_input evaluate domain
constraints without a CoerceToDomain node).  I.e
ExprContext.caseValue_datum etc. aren't used for normal expressions
anymore, just for the ad-hoc hackery in a bunch of places.

I'd like to get rid of those usages, but that'd recurse into rewriting
plpgsql casts and other random pieces of code into a different approach
- something I'd like to avoid doing at the same as this already large
patch.


I've been pondering if we can't entirely get rid of CaseTest etc, the
amount of hackery required seems not like a good thing. One way I'd
prototyped was to replace them with PARAM_EXEC nodes - then the whole
issue of them potentially having different values at different parts of
an expression vanishes because the aliasing is removed.



> Eventually, I would also like to find a way to remove the restriction
> imposed by the other part of f0c7b789a, ie that we can't inline a SQL
> function when that would result in intermixing two levels of CASE
> expression.  An implementation along the lines of what I've sketched
> above could handle that easily enough, as long as we could identify
> which nested level of CASE a particular CaseTestExpr belongs to.
> I don't know how to do that offhand, but it seems like it ought to be
> soluble if we put a bit of thought into it.

I haven't thought overly much about this, but I agree, it looks like it
should be doable.

That seems to suggest my PARAM_EXEC idea isn't necessarily perfect -
inlining would cause aliasing again, but it'd also not hard to fix that
up.

- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-16 11:15:16 -0400, Tom Lane wrote:
>> The thing that actually made the ExecEvalCase code into a bug was that
>> we were using ExprContext-level fields to store the current caseValue,
>> allowing aliasing to occur across nested CASEs.  I think that in
>> this implementation, it ought to be possible to get rid of
>> ExprContext.caseValue_datum et al altogether, in favor of some storage
>> location that's private to each CASE expression.  I'm a bit disappointed
>> that that doesn't seem to have happened.

> ...  Unfortunately
> CaseTest/CoerceToDomainValue are reused outside of domain / case
> expressions in a bunch of places (plpgsql uses CaseTest for casts
> evaluation, validateDomainConstraint/domain_check_input evaluate domain
> constraints without a CoerceToDomain node).

Yeah, there's various stuff that did that for expediency.

> I'd like to get rid of those usages, but that'd recurse into rewriting
> plpgsql casts and other random pieces of code into a different approach
> - something I'd like to avoid doing at the same as this already large
> patch.

Agreed, maybe we should just plan to clean that up later.

> I've been pondering if we can't entirely get rid of CaseTest etc, the
> amount of hackery required seems not like a good thing. One way I'd
> prototyped was to replace them with PARAM_EXEC nodes - then the whole
> issue of them potentially having different values at different parts of
> an expression vanishes because the aliasing is removed.

Yes, replacing all of that with Param slots had occurred to me too.
We might want to keep the special parse node types for convenience in
reverse-listing, but having them act just like PARAM_EXEC for execution
purposes seems promising.

>> Eventually, I would also like to find a way to remove the restriction
>> imposed by the other part of f0c7b789a, ie that we can't inline a SQL
>> function when that would result in intermixing two levels of CASE
>> expression.

> That seems to suggest my PARAM_EXEC idea isn't necessarily perfect -
> inlining would cause aliasing again, but it'd also not hard to fix that
> up.

Right, inlining would probably require some parameter-renumbering to avoid
aliasing.  But at least we'd have a clear framework for how to handle it,
whereas the way things are now, it's just impossible.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,


On 2017-03-19 23:55:50 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > I've been pondering if we can't entirely get rid of CaseTest etc, the
> > amount of hackery required seems not like a good thing. One way I'd
> > prototyped was to replace them with PARAM_EXEC nodes - then the whole
> > issue of them potentially having different values at different parts of
> > an expression vanishes because the aliasing is removed.
> 
> Yes, replacing all of that with Param slots had occurred to me too.
> We might want to keep the special parse node types for convenience in
> reverse-listing, but having them act just like PARAM_EXEC for execution
> purposes seems promising.

As long as that special parse-time node is part of the same value
numbering, that makes sense (could just name make it a subtype of param
ala PARAM_CASE).  I don't think we actually do anything useful in
ruleutils etc with either CaseTest or CoerceToDomainValue.

- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-15 20:09:03 -0400, Tom Lane wrote:
> I think it would be worth creating a README file giving an overview
> of how all of this patch is supposed to work.  You also need to do a
> whole lot more work on the function-level comments.

I tried to improve upon both fronts.  I've added the higher level
explanation to executor/README, but I don't feel very strong about that.

I'm not quite sure it's exactly what you wanted however, the above ask
could also be understood to have more of an motivational angle,
describing why and what exactly is changed?  I'm also still not sure how
understandable it's for anybody that hasn't had their head in this for a
while...


I've also, as discussed nearby, re-added code akin to the checks in
ExecEvalScalarVar, with the discussed adjustment of the opcodes for
later executions.  Now that I've done that, I'm not sure I like that
approach that much - another alternative would be to change an
ExprState's evalfunc to ExecCheckAndExecute() after initialization.
That'd have the advantage to work nicely for JIT.  Either way, that can
trivially be changed later.

Additionally I added a regression test for the nearly entirely untested
nodeTidscan.c, after I'd broken it previously without noticing (thanks
Andreas).

I started a run through valgrind, without complaints up to
create_function_2. Can't run all of it (and contrib) right now without
prematurely running out of power on a plane.


Did I understand correctly that you'd rather just merge
ExecGetLastAttnums into execExpr.c, instead of making it globally
available?

Greetings,

Andres Freund

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-15 20:09:03 -0400, Tom Lane wrote:
>> I think it would be worth creating a README file giving an overview
>> of how all of this patch is supposed to work.  You also need to do a
>> whole lot more work on the function-level comments.

> I tried to improve upon both fronts.  I've added the higher level
> explanation to executor/README, but I don't feel very strong about that.

> I'm not quite sure it's exactly what you wanted however, the above ask
> could also be understood to have more of an motivational angle,
> describing why and what exactly is changed?  I'm also still not sure how
> understandable it's for anybody that hasn't had their head in this for a
> while...

Well, I wasn't totally sure what was needed either.  I'm coming to this
relatively fresh, having paid little attention to the thread up to now,
so maybe I'll try to add material to what you wrote as I figure things
out.

> Did I understand correctly that you'd rather just merge
> ExecGetLastAttnums into execExpr.c, instead of making it globally
> available?

Yeah, just moving it over seems like the thing to do for now.  We can
expose it later if there proves to be an actual reason to do that,
but as I mentioned, I'm doubtful that there will be one.

I'll start reading these...
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> Additionally I added a regression test for the nearly entirely untested
> nodeTidscan.c, after I'd broken it previously without noticing (thanks
> Andreas).

I went ahead and pushed this part, since it seemed pretty uncontroversial.
I added a bit more stuff to get the LOC measurement up on the planner
side too.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
... is there a reason why resultnum for EEOP_ASSIGN_* steps is declared
size_t and not just int?  Since it's an array index, and one that
certainly can't be bigger than AttrNumber, that seems rather confusing.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-20 16:06:27 -0400, Tom Lane wrote:
> ... is there a reason why resultnum for EEOP_ASSIGN_* steps is declared
> size_t and not just int?  Since it's an array index, and one that
> certainly can't be bigger than AttrNumber, that seems rather confusing.

Not that I can see, no.  I guess I might have "overcompensated" when
changing it from AttrNumber - AttrNumber isn't a good idea because that
needs an extra move-zero-extend, because 16bit indexing isn't that well
supported on x86.  But that doesn't mean it should be a 64bit number -
to the contrary actually.

- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-20 16:06:27 -0400, Tom Lane wrote:
>> ... is there a reason why resultnum for EEOP_ASSIGN_* steps is declared
>> size_t and not just int?  Since it's an array index, and one that
>> certainly can't be bigger than AttrNumber, that seems rather confusing.

> Not that I can see, no.  I guess I might have "overcompensated" when
> changing it from AttrNumber - AttrNumber isn't a good idea because that
> needs an extra move-zero-extend, because 16bit indexing isn't that well
> supported on x86.  But that doesn't mean it should be a 64bit number -
> to the contrary actually.

OK, will fix in the edits I'm working on.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
I've been busily hacking away on this, trying to make things cleaner and
fix a couple of bugs I stumbled across.  (Confusion between ExecQual and
ExecCheck, for instance - we apparently lack regression tests exercising
constraints-returning-null in corner cases such as table rewrite.)  It
will probably be a day or two more before I'm done.

A couple of naming questions:

* I concur with Heikki's dislike for the file name execInterpExpr.c.
Would it make it any better to switch to execExprInterp.c?  I think
that having all this stuff under a common pattern execExpr*.c would
be a good thing (and I notice you've already got one comment referring
to them that way ...)

* execQual.c doesn't seem to have a unifying reason to exist anymore.
It certainly has little to do with evaluating typical qual expressions;
what's left in there seems to be mostly concerned with SRFs.  I feel
like it might be a good idea to rename it, but to what?  execExprUtils.c
perhaps?  Or maybe we should destroy it altogether, shoving the SRF
stuff into nodeFunctionscan.c and moving what little remains into
execUtils.c.

* I do not like the function name ExecInstantiateExpr().  Webster's
defines "instantiate" as "to represent (an abstraction) by a concrete
instance", which does not seem to me to have a lot to do with what this
function actually does.  There's nothing very abstract about its input.
More, the implication of "instantiate" is that you can instantiate any
number of representatives of the same abstraction, but this scribbles
on the input in a one-way fashion.  I think perhaps something like
ExecPrepareExpr or ExecFinishExpr or something along that line would
be better, but nothing is really standing out as le mot juste.

Thoughts?
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-22 10:41:06 -0400, Tom Lane wrote:
> I've been busily hacking away on this, trying to make things cleaner and
> fix a couple of bugs I stumbled across.  (Confusion between ExecQual and
> ExecCheck, for instance - we apparently lack regression tests exercising
> constraints-returning-null in corner cases such as table rewrite.)  It
> will probably be a day or two more before I'm done.

Thanks!


> A couple of naming questions:
> 
> * I concur with Heikki's dislike for the file name execInterpExpr.c.
> Would it make it any better to switch to execExprInterp.c?  I think
> that having all this stuff under a common pattern execExpr*.c would
> be a good thing (and I notice you've already got one comment referring
> to them that way ...)

That works for me.


> * execQual.c doesn't seem to have a unifying reason to exist anymore.
> It certainly has little to do with evaluating typical qual expressions;
> what's left in there seems to be mostly concerned with SRFs.  I feel
> like it might be a good idea to rename it, but to what?  execExprUtils.c
> perhaps?  Or maybe we should destroy it altogether, shoving the SRF
> stuff into nodeFunctionscan.c and moving what little remains into
> execUtils.c.

Yea, I was wondering about that too.  What would we do with
GetAttributeByName/Num?


> * I do not like the function name ExecInstantiateExpr().  Webster's
> defines "instantiate" as "to represent (an abstraction) by a concrete
> instance", which does not seem to me to have a lot to do with what this
> function actually does.

It perhaps makes a *bit* more sense if you view it from the POV that,
with the future JIT support (WIP versions of which I posted previously),
it'd actually create a compiled function which'd largely be independent
of the of ->steps (except for the non-hotpath functions, which'd still
end up using it).  So one of the above "conrete instances" would be the
interpeted version, another the compiled one.   No, not an entirely
convincing argument.


> There's nothing very abstract about its input.
> More, the implication of "instantiate" is that you can instantiate any
> number of representatives of the same abstraction, but this scribbles
> on the input in a one-way fashion.  I think perhaps something like
> ExecPrepareExpr or ExecFinishExpr or something along that line would
> be better, but nothing is really standing out as le mot juste.

Either of those work, but they don't strike me as perfect either, but I
can't come up with something better (ExecReadyExprForExec()?).

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-22 10:41:06 -0400, Tom Lane wrote:
>> * execQual.c doesn't seem to have a unifying reason to exist anymore.
>> It certainly has little to do with evaluating typical qual expressions;
>> what's left in there seems to be mostly concerned with SRFs.  I feel
>> like it might be a good idea to rename it, but to what?  execExprUtils.c
>> perhaps?  Or maybe we should destroy it altogether, shoving the SRF
>> stuff into nodeFunctionscan.c and moving what little remains into
>> execUtils.c.

> Yea, I was wondering about that too.  What would we do with
> GetAttributeByName/Num?

I was thinking execUtils.c for those.

>> * I do not like the function name ExecInstantiateExpr().  Webster's
>> defines "instantiate" as "to represent (an abstraction) by a concrete
>> instance", which does not seem to me to have a lot to do with what this
>> function actually does.
>> ...

> Either of those work, but they don't strike me as perfect either, but I
> can't come up with something better (ExecReadyExprForExec()?).

Actually, ExecReadyExpr seems kind of nice for this (using "ready" in its
verb sense, "to prepare (someone or something) for an activity or
purpose").
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
BTW, I'm fairly concerned by what you did in nodeTidscan.c, ie delaying
compile of the TID expressions until TidListCreate.  I think that probably
fails for cases involving, eg, subplans in the expressions; we need
subplans to get linked to the parent node, and this way won't do it till
(maybe) too late.  Barring objection I'm going to rearrange that so that
we still do the compile part at executor start.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-22 13:15:41 -0400, Tom Lane wrote:
> BTW, I'm fairly concerned by what you did in nodeTidscan.c, ie delaying
> compile of the TID expressions until TidListCreate.  I think that probably
> fails for cases involving, eg, subplans in the expressions; we need
> subplans to get linked to the parent node, and this way won't do it till
> (maybe) too late.  Barring objection I'm going to rearrange that so that
> we still do the compile part at executor start.

No objection here.



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Looking at some of the coding choices you made here, I see that you're
making a hard assumption that called functions never scribble on their
fcinfo->arg/argnull arrays.  I do not believe that we've ever had such
an assumption before.  Are we comfortable with that?  If so, we'd
better document it.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-23 13:14:59 -0400, Tom Lane wrote:
> Looking at some of the coding choices you made here, I see that you're
> making a hard assumption that called functions never scribble on their
> fcinfo->arg/argnull arrays.  I do not believe that we've ever had such
> an assumption before.

I think we did that before, e.g. ExecEvalScalarArrayOp(). Think there's
others too.


> Are we comfortable with that?  If so, we'd better document it.

I think it's ok, but we indeed should document it. I recall a note
somewhere... Can't find it anywhere however, might have misremembered a
note about pass-by-ref arguments.  fmgr/README? A note in
FunctionCallInfoData's definition?

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Stylistic thought ... I am wondering if it wouldn't be a good idea
to replace EEOP_CASE_WHEN_STEP, EEOP_CASE_THEN_STEP, EEOP_COALESCE,
and EEOP_ARRAYREF_CHECKINPUT with instructions defined in a less
usage-dependent way as
EEOP_JUMP        unconditional jumpEEOP_JUMP_IF_NULL    jump if step result is nullEEOP_JUMP_IF_NOT_NULL    jump if
stepresult isn't nullEEOP_JUMP_IF_NOT_TRUE    jump if step result isn't TRUE
 

One could imagine later filling out this set with the other BoolTest
condition types, but that seems to be all we need right now.

These are basically just renamings of the step types that exist now,
although EEOP_ARRAYREF_CHECKINPUT would have to drop its not-very-
necessary Assert(!op->d.arrayref.state->isassignment).  Well, I guess
I should say that they're renamings of the semantics that I have
for these steps in my working copy; for instance, I got rid of
casewhen.value/casewhen.isnull in favor of letting CASE WHEN expressions
evaluate into the CASE's final output variable.

At least to me, I think the compiling code would be more readable
this way.  I find WHEN_STEP and THEN_STEP a bit odd because they are
emitted after, not before, the expressions you'd think they control.
ARRAYREF_CHECKINPUT is pretty vaguely named, too.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
I found a rather nasty bug :-( ... the comment in EEOP_INNER_VAR_FIRST about

+            /*
+             * Can't assert tts_nvalid, as wholerow var evaluation or such
+             * could have materialized the slot - but the contents are still
+             * valid :/
+             */
+            Assert(op->d.var.attnum >= 0);

is actually averting its eyes from a potentially serious problem.  If we
go through ExecEvalWholeRowVar, which calls ExecFetchSlotTupleDatum, and
ExecFetchSlotTuple decides it has to materialize the tuple, then
ExecMaterializeSlot does this:
   /*    * Drop the pin on the referenced buffer, if there is one.    */   if (BufferIsValid(slot->tts_buffer))
ReleaseBuffer(slot->tts_buffer);
   slot->tts_buffer = InvalidBuffer;
   /*    * Mark extracted state invalid.  This is important because the slot is    * not supposed to depend any more on
theprevious external data; we    * mustn't leave any dangling pass-by-reference datums in tts_values.    * However, we
havenot actually invalidated any such datums, if there    * happen to be any previously fetched from the slot.  (Note
inparticular    * that we have not pfree'd tts_mintuple, if there is one.)    */   slot->tts_nvalid = 0;
 


The problem here is that once we drop the buffer pin, any pointers we may
have into on-disk data are dangling pointers --- we're at risk of some
other backend taking away that shared buffer.  (So I'm afraid that the
commentary there is overly optimistic.)  So even though those pointers
may still be there beyond tts_nvalid, subsequent references to them are
very dangerous.

I think that it's pretty hard to hit this in practice, maybe impossible,
because the normal case for an "on-disk" tuple is that
TTS_HAS_PHYSICAL_TUPLE is true, so that ExecFetchSlotTuple won't change
the state of the slot.  If we have a virtual tuple that has to be
materialized, then by that very token it won't have a buffer pin to drop.
But I find this fragile as heck, and the aforesaid patch comment surely
isn't adequately documenting the safety considerations.  Also, if there
ever were a live bug here, reproducing it would be damn hard because of
the low probability that a just-unpinned buffer would get replaced any
time soon.  (Hm, I wonder whether the buffer cache needs something
analogous to the syscaches' CLOBBER_CACHE_ALWAYS behavior...)

Besides which, I really really don't like the lack of an "attnum <
tts_nvalid" assertion there; that's just obviously failing to check for
very simple bugs, such as getting the FETCHSOME steps wrong.

So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't
clobber the state of the slot.  Right at the moment, the only way to do
that seems to be to do this instead of ExecFetchSlotTupleDatum:
   tuple = ExecCopySlotTuple(slot);   dtuple = (HeapTupleHeader)       DatumGetPointer(heap_copy_tuple_as_datum(tuple,
                                             slot->tts_tupleDescriptor));   heap_freetuple(tuple);
 

That's kind of annoying because of the double copying involved, but
tuptoaster.c doesn't expose any functionality for this except
heap_copy_tuple_as_datum().  I figure we can improve it later --- it looks
like we can refactor heap_copy_tuple_as_datum to expose a function that
reads from Datum/isnull arrays, and then call it on the slot's
tts_values/tts_isnull arrays.  Seems like it might be a good idea to
think a bit harder about the TupleTableSlot APIs, too, and reduce the
number of cases where execTuples exposes destructive changes to the
state of a slot.

Also, while trying to test the above scenario, I realized that the patch
as submitted was being way too cavalier about where it was applying 
CheckVarSlotCompatibility and so on.  The ASSIGN_FOO_VAR steps, for
instance, had no protection at all.  Think I have that all fixed up
though.

I hope to have a fully reviewed patch to pass back to you tomorrow.
Or Saturday at the latest.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,m

On 2017-03-23 17:40:55 -0400, Tom Lane wrote:
> Stylistic thought ... I am wondering if it wouldn't be a good idea
> to replace EEOP_CASE_WHEN_STEP, EEOP_CASE_THEN_STEP, EEOP_COALESCE,
> and EEOP_ARRAYREF_CHECKINPUT with instructions defined in a less
> usage-dependent way as
> 
>     EEOP_JUMP        unconditional jump
>     EEOP_JUMP_IF_NULL    jump if step result is null
>     EEOP_JUMP_IF_NOT_NULL    jump if step result isn't null
>     EEOP_JUMP_IF_NOT_TRUE    jump if step result isn't TRUE
> 
> One could imagine later filling out this set with the other BoolTest
> condition types, but that seems to be all we need right now.

Hm, no arguments against, but I'm also not particularly excited about
the change.


> These are basically just renamings of the step types that exist now,
> although EEOP_ARRAYREF_CHECKINPUT would have to drop its not-very-
> necessary Assert(!op->d.arrayref.state->isassignment).

I won't shed a tear about that assert's removal.


> Well, I guess I should say that they're renamings of the semantics
> that I have for these steps in my working copy; for instance, I got
> rid of casewhen.value/casewhen.isnull in favor of letting CASE WHEN
> expressions evaluate into the CASE's final output variable.

That sounds like a sensible change (in the abstract, I obviously haven't
seen your working copy).


Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-23 20:36:32 -0400, Tom Lane wrote:
> I found a rather nasty bug :-( ... the comment in EEOP_INNER_VAR_FIRST about
> 
> +            /*
> +             * Can't assert tts_nvalid, as wholerow var evaluation or such
> +             * could have materialized the slot - but the contents are still
> +             * valid :/
> +             */
> +            Assert(op->d.var.attnum >= 0);
> 
> is actually averting its eyes from a potentially serious problem.  If we
> go through ExecEvalWholeRowVar, which calls ExecFetchSlotTupleDatum, and
> ExecFetchSlotTuple decides it has to materialize the tuple, then
> ExecMaterializeSlot does this:
> 
>     /*
>      * Drop the pin on the referenced buffer, if there is one.
>      */
>     if (BufferIsValid(slot->tts_buffer))
>         ReleaseBuffer(slot->tts_buffer);
> 
>     slot->tts_buffer = InvalidBuffer;
> 
>     /*
>      * Mark extracted state invalid.  This is important because the slot is
>      * not supposed to depend any more on the previous external data; we
>      * mustn't leave any dangling pass-by-reference datums in tts_values.
>      * However, we have not actually invalidated any such datums, if there
>      * happen to be any previously fetched from the slot.  (Note in particular
>      * that we have not pfree'd tts_mintuple, if there is one.)
>      */
>     slot->tts_nvalid = 0;
> 
> 
> The problem here is that once we drop the buffer pin, any pointers we may
> have into on-disk data are dangling pointers --- we're at risk of some
> other backend taking away that shared buffer.  (So I'm afraid that the
> commentary there is overly optimistic.)  So even though those pointers
> may still be there beyond tts_nvalid, subsequent references to them are
> very dangerous.

This applies to the code in master as well, no?  An ExecEvalScalarVar()
followed by an ExecEvalWholeRowVar() would have precisely the same
effect?  Do we need to do anything about this in the back-branches,
given how unlikely this is going to be in practice?


> So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't
> clobber the state of the slot.

That seems like a good plan.


> Also, while trying to test the above scenario, I realized that the patch
> as submitted was being way too cavalier about where it was applying 
> CheckVarSlotCompatibility and so on.  The ASSIGN_FOO_VAR steps, for
> instance, had no protection at all.

I don't think that's true - the assign checks had copied the code from
the old ExecBuildProjectionInfo, setting isSimpleVar iff
(!attr->attisdropped && variable->vartype == attr->atttypid) - we can
check that for projections in contrast to normal expressions because we
already know the slot.  The relevant comment for that, from before the
patch, is: * inputDesc.  (Note: if there is a type mismatch then ExecEvalScalarVar * will probably throw an error at
runtime,but we leave that to it.) */
 


> I hope to have a fully reviewed patch to pass back to you tomorrow.
> Or Saturday at the latest.

Cool.

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-23 20:36:32 -0400, Tom Lane wrote:
> So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't
> clobber the state of the slot.  Right at the moment, the only way to do
> that seems to be to do this instead of ExecFetchSlotTupleDatum:
> 
>     tuple = ExecCopySlotTuple(slot);
>     dtuple = (HeapTupleHeader)
>         DatumGetPointer(heap_copy_tuple_as_datum(tuple,
>                                                  slot->tts_tupleDescriptor));
>     heap_freetuple(tuple);

Hm.  One disadvantage would be that repeated whole-row references to the
same table would be a bit slower, because we'd repeatedly form a tuple
from a virtual one - but I have a hard time coming up with a scenario
where that'd matter.  I'd suspect that in the end it'd probably even
have a *positive* performance impact, because right now the next scalar
access will have to deform the whole tuple again, and that seems like a
lot more likely scenario.

Greetings,

Andres Freund



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-23 20:36:32 -0400, Tom Lane wrote:
>> The problem here is that once we drop the buffer pin, any pointers we may
>> have into on-disk data are dangling pointers --- we're at risk of some
>> other backend taking away that shared buffer.  (So I'm afraid that the
>> commentary there is overly optimistic.)  So even though those pointers
>> may still be there beyond tts_nvalid, subsequent references to them are
>> very dangerous.

> This applies to the code in master as well, no?  An ExecEvalScalarVar()
> followed by an ExecEvalWholeRowVar() would have precisely the same
> effect?

Yeah.  The other order would be safe, because ExecEvalScalarVar would do
slot_getattr which would re-extract the value from the newly materialized
tuple.  But there definitely seems to be a hazard for the order you
mentioned.

> Do we need to do anything about this in the back-branches,
> given how unlikely this is going to be in practice?

Probably not.  As I mentioned, I think this may be only theoretical rather
than real, if you believe that buffer pins would only be associated with
slots holding references to regular tuples.  And even if it's not
theoretical, the odds of seeing a failure in the field seem pretty tiny
given that a just-released buffer shouldn't be subject to recycling for
a fair while.  But I don't want to leave it like this going forward.

>> So I think that we have got to fix ExecEvalWholeRowVar so that it doesn't
>> clobber the state of the slot.

> That seems like a good plan.

Yeah.  I have the stopgap code in my working copy, and will look at
refactoring the tuptoaster code for better performance later.

>> Also, while trying to test the above scenario, I realized that the patch
>> as submitted was being way too cavalier about where it was applying 
>> CheckVarSlotCompatibility and so on.  The ASSIGN_FOO_VAR steps, for
>> instance, had no protection at all.

> I don't think that's true - the assign checks had copied the code from
> the old ExecBuildProjectionInfo, setting isSimpleVar iff
> (!attr->attisdropped && variable->vartype == attr->atttypid) - we can
> check that for projections in contrast to normal expressions because we
> already know the slot.

Hmm, I see ... but that only works in the cases where the caller of
ExecBuildProjectionInfo supplied a source slot, and a lot of 'em don't.
As the code stands, we are unable to use ASSIGN_FOO_VAR in quite a lot
of places, including everywhere above the relation scan level.

I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST
step types.  I could take it back out, but I wonder if it wouldn't be
smarter to keep it and remove the restriction in ExecBuildProjectionInfo.
Or maybe we could have ExecBuildProjectionInfo emit either
ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove
the reference safe.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-23 21:26:19 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-23 20:36:32 -0400, Tom Lane wrote:
> >> The problem here is that once we drop the buffer pin, any pointers we may
> >> have into on-disk data are dangling pointers --- we're at risk of some
> >> other backend taking away that shared buffer.  (So I'm afraid that the
> >> commentary there is overly optimistic.)  So even though those pointers
> >> may still be there beyond tts_nvalid, subsequent references to them are
> >> very dangerous.
> 
> > This applies to the code in master as well, no?  An ExecEvalScalarVar()
> > followed by an ExecEvalWholeRowVar() would have precisely the same
> > effect?
> 
> Yeah.  The other order would be safe, because ExecEvalScalarVar would do
> slot_getattr which would re-extract the value from the newly materialized
> tuple.  But there definitely seems to be a hazard for the order you
> mentioned.
> 
> > Do we need to do anything about this in the back-branches,
> > given how unlikely this is going to be in practice?
> 
> Probably not.  As I mentioned, I think this may be only theoretical rather
> than real, if you believe that buffer pins would only be associated with
> slots holding references to regular tuples.  And even if it's not
> theoretical, the odds of seeing a failure in the field seem pretty tiny
> given that a just-released buffer shouldn't be subject to recycling for
> a fair while.  But I don't want to leave it like this going forward.

Ok.


> >> Also, while trying to test the above scenario, I realized that the patch
> >> as submitted was being way too cavalier about where it was applying 
> >> CheckVarSlotCompatibility and so on.  The ASSIGN_FOO_VAR steps, for
> >> instance, had no protection at all.
> 
> > I don't think that's true - the assign checks had copied the code from
> > the old ExecBuildProjectionInfo, setting isSimpleVar iff
> > (!attr->attisdropped && variable->vartype == attr->atttypid) - we can
> > check that for projections in contrast to normal expressions because we
> > already know the slot.
> 
> Hmm, I see ... but that only works in the cases where the caller of
> ExecBuildProjectionInfo supplied a source slot, and a lot of 'em
> don't.

Right, the old and new code comment on that:
* inputDesc can be NULL, but if it is not, we check to see whether simple* Vars in the tlist match the descriptor.  It
isimportant to provide* inputDesc for relation-scan plan nodes, as a cross check that the relation* hasn't been changed
sincethe plan was made.  At higher levels of a plan,* there is no need to recheck.
 

and that seems like reasonable to me?  That said, I think we can remove
that assumption, by checking once.


> As the code stands, we are unable to use ASSIGN_FOO_VAR in quite a lot
> of places, including everywhere above the relation scan level.

Hm? If inputDesc isn't given we just, before and after, do:        if (!inputDesc)            isSimpleVar = true;
/* can't check type, assume OK */
 



> I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST
> step types.  I could take it back out, but I wonder if it wouldn't be
> smarter to keep it and remove the restriction in ExecBuildProjectionInfo.
> Or maybe we could have ExecBuildProjectionInfo emit either
> ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove
> the reference safe.

I think it's probably ok to just leave the check in, and remove those
comments, and simplify the isSimpleVar stuff to only check ifIsA(tle->expr, Var) && ((Var *) tle->expr)->varattno > 0)

- Andres



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-23 21:26:19 -0400, Tom Lane wrote:
>> Hmm, I see ... but that only works in the cases where the caller of
>> ExecBuildProjectionInfo supplied a source slot, and a lot of 'em
>> don't.

> Right, the old and new code comment on that:

>  * inputDesc can be NULL, but if it is not, we check to see whether simple
>  * Vars in the tlist match the descriptor.  It is important to provide
>  * inputDesc for relation-scan plan nodes, as a cross check that the relation
>  * hasn't been changed since the plan was made.  At higher levels of a plan,
>  * there is no need to recheck.

Ah, I'd forgotten the assumption that we only need to check this at scan
level.

>> I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST
>> step types.  I could take it back out, but I wonder if it wouldn't be
>> smarter to keep it and remove the restriction in ExecBuildProjectionInfo.
>> Or maybe we could have ExecBuildProjectionInfo emit either
>> ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove
>> the reference safe.

> I think it's probably ok to just leave the check in, and remove those
> comments, and simplify the isSimpleVar stuff to only check if
>  IsA(tle->expr, Var) && ((Var *) tle->expr)->varattno > 0)

Not sure.  It's a pretty fair amount of duplicative code, once you finish
dealing with all the ExecJustFoo functions in addition to the main code
paths.  At this point I'm inclined to take it back out and improve the
comments around ExecBuildProjectionInfo.
        regards, tom lane



Re: [HACKERS] WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-23 21:58:03 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-23 21:26:19 -0400, Tom Lane wrote:
> >> Hmm, I see ... but that only works in the cases where the caller of
> >> ExecBuildProjectionInfo supplied a source slot, and a lot of 'em
> >> don't.
> 
> > Right, the old and new code comment on that:
> 
> >  * inputDesc can be NULL, but if it is not, we check to see whether simple
> >  * Vars in the tlist match the descriptor.  It is important to provide
> >  * inputDesc for relation-scan plan nodes, as a cross check that the relation
> >  * hasn't been changed since the plan was made.  At higher levels of a plan,
> >  * there is no need to recheck.
> 
> Ah, I'd forgotten the assumption that we only need to check this at scan
> level.
> 
> >> I'd already put in the infrastructure to add ASSIGN_FOO_VAR_FIRST
> >> step types.  I could take it back out, but I wonder if it wouldn't be
> >> smarter to keep it and remove the restriction in ExecBuildProjectionInfo.
> >> Or maybe we could have ExecBuildProjectionInfo emit either
> >> ASSIGN_FOO_VAR_FIRST or ASSIGN_FOO_VAR depending on whether it can prove
> >> the reference safe.
> 
> > I think it's probably ok to just leave the check in, and remove those
> > comments, and simplify the isSimpleVar stuff to only check if
> >  IsA(tle->expr, Var) && ((Var *) tle->expr)->varattno > 0)
> 
> Not sure.  It's a pretty fair amount of duplicative code, once you finish
> dealing with all the ExecJustFoo functions in addition to the main code
> paths.  At this point I'm inclined to take it back out and improve the
> comments around ExecBuildProjectionInfo.

I'm ok with both.

- Andres



Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Another modest proposal:

I'm not really sold on the approach of using EEOP_FETCHSOME opcodes to
trigger initial tupleslot de-forming.  Certainly we want to have a single
slot_getsomeattrs call per source slot, but as-is, we need a separate
traversal over the expression tree just to precompute the max attribute
number needed.  That can't be good for expression compile speed, and
it introduces nontrivial bug risks because the code that does that
is completely decoupled from the code that emits the EEOP_VAR opcodes
(which are what's really relying on the de-forming to have happened).

What do you think about a design like this:

* Drop the FETCHSOME opcodes.

* Add fields to struct ExprState that will hold the maximum inner,
outer, and scan attribute numbers needed.

* ExecInitExpr initializes those fields to zero, and then during
ExecInitExprRec, whenever we generate an EEOP_VAR opcode, we do e.g.
state->last_inner_attno = Max(state->last_inner_attno,                              variable->varattno);

* ExecInitExprSlots, get_last_attnums_walker, etc all go away.

* In the startup segment of ExecInterpExpr, add
if (state->last_inner_attno > 0)    slot_getsomeattrs(innerslot, state->last_inner_attno);if (state->last_outer_attno >
0)   slot_getsomeattrs(outerslot, state->last_outer_attno);if (state->last_scan_attno > 0)
slot_getsomeattrs(scanslot,state->last_scan_attno);
 

This would be a little different from the existing code as far as runtime
branch-prediction behavior goes, but it's not apparent to me that it'd be
any worse.  Also, for JIT purposes it'd still be entirely possible to
compile the slot_getsomeattrs calls in-line; you'd just be looking to a
different part of the ExprState struct to find out what to do.
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-24 11:26:27 -0400, Tom Lane wrote:
> Another modest proposal:
> 
> I'm not really sold on the approach of using EEOP_FETCHSOME opcodes to
> trigger initial tupleslot de-forming.  Certainly we want to have a single
> slot_getsomeattrs call per source slot, but as-is, we need a separate
> traversal over the expression tree just to precompute the max attribute
> number needed.  That can't be good for expression compile speed, and
> it introduces nontrivial bug risks because the code that does that
> is completely decoupled from the code that emits the EEOP_VAR opcodes
> (which are what's really relying on the de-forming to have happened).

Hm.  We had the separate traversal for projections for a long while, and
I don't think there've been a a lot of changes to the extraction of the
last attribute number.  I'm very doubtful that the cost of traversing
the expression twice is meaningful in comparison to the other costs.


> What do you think about a design like this:
> 
> * Drop the FETCHSOME opcodes.
> 
> * Add fields to struct ExprState that will hold the maximum inner,
> outer, and scan attribute numbers needed.
> 
> * ExecInitExpr initializes those fields to zero, and then during
> ExecInitExprRec, whenever we generate an EEOP_VAR opcode, we do e.g.
> 
>     state->last_inner_attno = Max(state->last_inner_attno,
>                                   variable->varattno);
> 
> * ExecInitExprSlots, get_last_attnums_walker, etc all go away.
> 
> * In the startup segment of ExecInterpExpr, add
> 
>     if (state->last_inner_attno > 0)
>         slot_getsomeattrs(innerslot, state->last_inner_attno);
>     if (state->last_outer_attno > 0)
>         slot_getsomeattrs(outerslot, state->last_outer_attno);
>     if (state->last_scan_attno > 0)
>         slot_getsomeattrs(scanslot, state->last_scan_attno);
> 
> This would be a little different from the existing code as far as runtime
> branch-prediction behavior goes, but it's not apparent to me that it'd be
> any worse.

I'd be suprised if it weren't.

I'm not super strongly against this setup, but I fail to see the benefit
of whacking this around.  I've benchmarked the previous/current setup
fairly extensively, and I'd rather not redo that.  In contrast to the
other changes you've talked about, this definitely is in the hot-path...


> Also, for JIT purposes it'd still be entirely possible to compile the
> slot_getsomeattrs calls in-line; you'd just be looking to a different
> part of the ExprState struct to find out what to do.

Yea, we could do that.

Greetings,

Andres Freund



Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-24 11:26:27 -0400, Tom Lane wrote:
>> Another modest proposal:
>> 
>> I'm not really sold on the approach of using EEOP_FETCHSOME opcodes to
>> trigger initial tupleslot de-forming.  Certainly we want to have a single
>> slot_getsomeattrs call per source slot, but as-is, we need a separate
>> traversal over the expression tree just to precompute the max attribute
>> number needed.  That can't be good for expression compile speed, and
>> it introduces nontrivial bug risks because the code that does that
>> is completely decoupled from the code that emits the EEOP_VAR opcodes
>> (which are what's really relying on the de-forming to have happened).

> Hm.  We had the separate traversal for projections for a long while, and
> I don't think there've been a a lot of changes to the extraction of the
> last attribute number.

That's easily disproven just by looking at the code:
   /*    * Don't examine the arguments or filters of Aggrefs or WindowFuncs,    * because those do not represent
expressionsto be evaluated within the    * calling expression's econtext.  GroupingFunc arguments are never    *
evaluatedat all.    */   if (IsA(node, Aggref))       return false;   if (IsA(node, WindowFunc))       return false;
if(IsA(node, GroupingFunc))       return false;   return expression_tree_walker(node, get_last_attnums_walker,
                      (void *) info);
 

The WindowFunc exception hasn't been there so long, and the GroupingFunc
one is very new.  And who's to say whether e.g. the recent XMLTABLE patch
got this right at all?  We could easily be extracting columns we don't
need to.

I'm willing to leave this as-is for the moment, but I really think we
should look into changing it (after the main patch is in).
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Attached is an updated patch.  I believe this is committable, but
since I whacked it around quite a bit, I'm sure you'll want to look
it over first.

Please be sure the commit message notes that function EXECUTE permissions
are now checked at executor startup not first call.  We need to document
that in the v10 release notes as an incompatibility, and I'm sure we'll
forget if the commit log doesn't point it out.

Some loose ends that remain to be looked at, though I do not think any
of these are reasons to postpone commit:

* I'm concerned that we now do not have enough check_stack_depth() calls.
In our off-list discussion I wanted to add one to ExecProcNode, and you
were unhappy about that ... but it'd still be a lot less stack checking
than we do now.

* I still think we should look into removing the EEOP_FETCHSOME op types,
or at least finding some other way to perform the calculation of the last
attnums in the mainline expression compilation path.

* As we discussed, ExecEvalWholeRowVar is now using a pretty inefficient
method for flattening tuples into datums.  I will take a to-do item to
fix this.

* ExecInitCheck is really just ExecInitExpr with a make_ands_explicit
call in front of it.  It turned out that most of the places that were
(or should have been) calling it were doing a make_ands_implicit call
for no other reason than to satisfy its API.  I changed most of those
to just call ExecInitExpr directly.  There are just a couple of call
sites left, and I think probably those are likewise not really that
happy with this API --- but I didn't take the time to chase down where
the expressions were coming from in those cases.  It seems possible
that we could remove ExecInitCheck/ExecPrepareCheck entirely.  I'll
look into this later, too.

* As we discussed, CaseTestValue/DomainValue are pretty messy and need
to be rethought.  That might not get done for v10 though.

* I'm not very happy that execSRF.c has two somewhat different
implementations of largely similar functionality, and
SetExprState.elidedFuncState seems like a wart.  That's mostly a
pre-existing problem of course.  I'm satisfied with leaving it as it is
for now, but eventually some refactoring there would be good.

The attached patch is against HEAD as of last night (commit 457a44487).

            regards, tom lane


Attachment

Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
btw ... I just got around to looking at a code coverage report for this
patched version, and that reminded me of something I'd already suspected:
EEOP_INNER_SYSVAR and EEOP_OUTER_SYSVAR seem to be dead code.  That's
unsurprising, because we never try to access a tuple's system columns
above the scan level.  If a query asks for system columns, those get
passed up to upper query levels as ordinary user-side columns.

We could keep the execution support for those opcodes, or we could rip it
out and throw an error in execExpr.c if one would need to be generated.
I'm a little inclined to the latter, because it seems like the plan is
to grow more support for every opcode in the future.  We don't need to
be adding support for unreachable opcodes.
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
More random musing ... have you considered making the jump-target fields
in expressions be relative rather than absolute indexes?  That is,
EEO_JUMP would look like
    op += (stepno); \    EEO_DISPATCH(); \

instead of
    op = &state->steps[stepno]; \    EEO_DISPATCH(); \

I have not carried out a full patch to make this work, but just making
that one change and examining the generated assembly code looks promising.
Instead of this
movslq    40(%r14), %r8salq    $6, %r8addq    24(%rbx), %r8movq    %r8, %r14jmp    *(%r8)

we get this
movslq    40(%r14), %raxsalq    $6, %raxaddq    %rax, %r14jmp    *(%r14)

which certainly looks like it ought to be faster.  Also, the real reason
I got interested in this at all is that with relative jumps, groups of
steps would be position-independent within the steps array, which would
enable some compile-time tricks that seem impractical with the current
definition.

BTW, now that I've spent a bit of time looking at the generated assembly
code, I'm kind of disinclined to believe any arguments about how we have
better control over branch prediction with the jump-threading
implementation.  At least with current gcc (6.3.1 on Fedora 25) at -O2,
what I see is multiple places jumping to the same indirect jump
instruction :-(.  It's not a total disaster: as best I can tell, all the
uses of EEO_JUMP remain distinct.  But gcc has chosen to implement about
40 of the 71 uses of EEO_NEXT by jumping to the same couple of
instructions that increment the "op" register and then do an indirect
jump :-(.

So it seems that we're at the mercy of gcc's whims as to which instruction
dispatches will be distinguishable to the hardware; which casts a very
dark shadow over any benchmarking-based arguments that X is better than Y
for branch prediction purposes.  Compiler version differences are likely
to matter a lot more than anything we do.
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:
Hi,

On 2017-03-24 17:16:15 -0400, Tom Lane wrote:
> Attached is an updated patch.  I believe this is committable, but
> since I whacked it around quite a bit, I'm sure you'll want to look
> it over first.

Indeed.

Points:

- It's going to take a while for me to get used to execExprInterp.c - I constantly try to switch to a nonexistant
buffer;)
 

- Like your approach with /* Check that current tupdesc doesn't have more fields than we allocated */ in
ExecEvalFieldStoreDeForm().

- I like EEO_FLAG_IS_QUAL.

- I think at some point (not sure whether in your or in my revision) ScalarArrayOpExpr lost its permission check. Added
that.

- Added note that we knowingly don't perform permission checks for input/output funcs.

- Both pre/post patch, CoerceViaIO doesn't invoke InvokeFunctionExecuteHook - I'm still unclear what that hook is
usefulfor, so ...
 

- The !caseExpr->defresult result branch is currently unreachable (and its equivalent was before the patch) because
transformCaseExpr()generates a default expression.  I'm inclined to replace the dead code with an assertion. Any reason
notto do that?
 

- I see you kept determination of the set of constraints to be checked for domain at initialization time. Made note
thatthat a) might change b) could change behaviour.
 


> Please be sure the commit message notes that function EXECUTE permissions
> are now checked at executor startup not first call.  We need to document
> that in the v10 release notes as an incompatibility, and I'm sure we'll
> forget if the commit log doesn't point it out.

Done.


> * As we discussed, ExecEvalWholeRowVar is now using a pretty inefficient
> method for flattening tuples into datums.  I will take a to-do item to
> fix this.

Thanks.


> * ExecInitCheck is really just ExecInitExpr with a make_ands_explicit
> call in front of it.  It turned out that most of the places that were
> (or should have been) calling it were doing a make_ands_implicit call
> for no other reason than to satisfy its API.  I changed most of those
> to just call ExecInitExpr directly.  There are just a couple of call
> sites left, and I think probably those are likewise not really that
> happy with this API --- but I didn't take the time to chase down where
> the expressions were coming from in those cases.  It seems possible
> that we could remove ExecInitCheck/ExecPrepareCheck entirely.  I'll
> look into this later, too.

Thanks^2.


> * As we discussed, CaseTestValue/DomainValue are pretty messy and need
> to be rethought.  That might not get done for v10 though.

Yea, I'm disinclined to tackle this just now, there's enough other stuff
pending - but I'm willing to tackle it for v11 unless you beat me to it.


> * I'm not very happy that execSRF.c has two somewhat different
> implementations of largely similar functionality, and
> SetExprState.elidedFuncState seems like a wart.  That's mostly a
> pre-existing problem of course.  I'm satisfied with leaving it as it is
> for now, but eventually some refactoring there would be good.

Yea, that's not pretty.  Given the historical difference in behaviour
between SRF and ROWS FROM (the latter always materializes, the former
only when the set function does so itself), I'm not sure they can easily
be simplified.

I'm not really sure how to get rid of the issue underlying
elidedFuncState?  I mean we could move that case into nodeFunctionscan,
above ExecMakeTableFunctionResult's level, but that doesn't really seem
like an improvement.


I think there's some argument to be made that we should move all of
execSRF.c's logic to their respective callsites.  There's not really
much shared code: ExecEvalFuncArgs(), tupledesc_match(), init_sexpr() -
and at least the latter would even become a bit simpler if we didn't
support both callers.


Thanks a lot for your work on the patch!

And pushed.

- Andres



Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> - The !caseExpr->defresult result branch is currently unreachable (and
>   its equivalent was before the patch) because transformCaseExpr()
>   generates a default expression.  I'm inclined to replace the dead code
>   with an assertion. Any reason not to do that?

Ah-hah.  I'd noted that that code wasn't being reached when I did some
code coverage checks this morning, but I hadn't found the cause yet.
Yeah, replacing the if-test with an assert seems fine.

> I think there's some argument to be made that we should move all of
> execSRF.c's logic to their respective callsites.  There's not really
> much shared code: ExecEvalFuncArgs(), tupledesc_match(), init_sexpr() -
> and at least the latter would even become a bit simpler if we didn't
> support both callers.

Possibly.  All that code could stand to be rethought, probably, now that
its mission is just to handle the SRF case.  But at least all the cruft is
in one place for the moment.  And I don't think it's anything we have to
fix before v10, it's just cosmetic.

> Thanks a lot for your work on the patch!

You're welcome!
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-25 12:22:15 -0400, Tom Lane wrote:
> More random musing ... have you considered making the jump-target fields
> in expressions be relative rather than absolute indexes?  That is,
> EEO_JUMP would look like
> 
>         op += (stepno); \
>         EEO_DISPATCH(); \
> 
> instead of
> 
>         op = &state->steps[stepno]; \
>         EEO_DISPATCH(); \
> 
> I have not carried out a full patch to make this work, but just making
> that one change and examining the generated assembly code looks promising.
> Instead of this
> 
>     movslq    40(%r14), %r8
>     salq    $6, %r8
>     addq    24(%rbx), %r8
>     movq    %r8, %r14
>     jmp    *(%r8)
> 
> we get this
> 
>     movslq    40(%r14), %rax
>     salq    $6, %rax
>     addq    %rax, %r14
>     jmp    *(%r14)

That seems like a good idea.  I've not done this in the committed
version (and I don't think we necessarily need to this before the
release), but fo rthe future it seems like a good plan.  It makes sense
that it's faster - there's no need to reference state->steps.


> which certainly looks like it ought to be faster.  Also, the real reason
> I got interested in this at all is that with relative jumps, groups of
> steps would be position-independent within the steps array, which would
> enable some compile-time tricks that seem impractical with the current
> definition.

Indeed.


> BTW, now that I've spent a bit of time looking at the generated assembly
> code, I'm kind of disinclined to believe any arguments about how we have
> better control over branch prediction with the jump-threading
> implementation.

I measured the performance difference between using it and not using it,
and it came out a pretty clear plus. On gcc 6.3, gcc master snapshot,
and clang-3.9.  It's not just that more jumps are duplicated, it's also
that the switch() always adds a boundary check.


> At least with current gcc (6.3.1 on Fedora 25) at -O2,
> what I see is multiple places jumping to the same indirect jump
> instruction :-(.  It's not a total disaster: as best I can tell, all the
> uses of EEO_JUMP remain distinct.  But gcc has chosen to implement about
> 40 of the 71 uses of EEO_NEXT by jumping to the same couple of
> instructions that increment the "op" register and then do an indirect
> jump :-(.

Yea, I see some of that too - "usually" when there's more than just the
jump in common.  I think there's some gcc variables that influence this
(min-crossjump-insns (5), max-goto-duplication-insns (8)).  Might be
worthwhile experimenting with setting them locally via a pragma or such.
I think Aants wanted to experiment with that, too.

Then there's also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785
which causes some forms of computed goto (not ours I think) to be
deoptimized in gcc.


Greetings,

Andres Freund



Re: WIP: Faster Expression Processing v4

From
Ants Aasma
Date:
On Sun, Mar 26, 2017 at 12:22 AM, Andres Freund <andres@anarazel.de> wrote:
>> At least with current gcc (6.3.1 on Fedora 25) at -O2,
>> what I see is multiple places jumping to the same indirect jump
>> instruction :-(.  It's not a total disaster: as best I can tell, all the
>> uses of EEO_JUMP remain distinct.  But gcc has chosen to implement about
>> 40 of the 71 uses of EEO_NEXT by jumping to the same couple of
>> instructions that increment the "op" register and then do an indirect
>> jump :-(.
>
> Yea, I see some of that too - "usually" when there's more than just the
> jump in common.  I think there's some gcc variables that influence this
> (min-crossjump-insns (5), max-goto-duplication-insns (8)).  Might be
> worthwhile experimenting with setting them locally via a pragma or such.
> I think Aants wanted to experiment with that, too.

I haven't had the time to research this properly, but initial tests
show that with GCC 6.2 adding

#pragma GCC optimize ("no-crossjumping")

fixes merging of the op tail jumps.

Some quick and dirty benchmarking suggests that the benefit for the
interpreter is about 15% (5% speedup on a workload that spends 1/3 in
ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
vars before the indirect jump is somewhere between a tiny benefit and
no effect, certainly not worth introducing extra complexity. Clang 3.8
does the correct thing out of the box and is a couple of percent
faster than GCC with the pragma.

Regards,
Ants Aasma



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:

On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote:
>On Sun, Mar 26, 2017 at 12:22 AM, Andres Freund <andres@anarazel.de>
>wrote:
>>> At least with current gcc (6.3.1 on Fedora 25) at -O2,
>>> what I see is multiple places jumping to the same indirect jump
>>> instruction :-(.  It's not a total disaster: as best I can tell, all
>the
>>> uses of EEO_JUMP remain distinct.  But gcc has chosen to implement
>about
>>> 40 of the 71 uses of EEO_NEXT by jumping to the same couple of
>>> instructions that increment the "op" register and then do an
>indirect
>>> jump :-(.
>>
>> Yea, I see some of that too - "usually" when there's more than just
>the
>> jump in common.  I think there's some gcc variables that influence
>this
>> (min-crossjump-insns (5), max-goto-duplication-insns (8)).  Might be
>> worthwhile experimenting with setting them locally via a pragma or
>such.
>> I think Aants wanted to experiment with that, too.
>
>I haven't had the time to research this properly, but initial tests
>show that with GCC 6.2 adding
>
>#pragma GCC optimize ("no-crossjumping")
>
>fixes merging of the op tail jumps.
>
>Some quick and dirty benchmarking suggests that the benefit for the
>interpreter is about 15% (5% speedup on a workload that spends 1/3 in
>ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
>vars before the indirect jump is somewhere between a tiny benefit and
>no effect, certainly not worth introducing extra complexity. Clang 3.8
>does the correct thing out of the box and is a couple of percent
>faster than GCC with the pragma.

That's large enough to be worth doing (although I recall you seeing all jumps commonalized).  We should probably do
thison a per function basis however (either using pragma push option, or function attributes). 

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote:
>> I haven't had the time to research this properly, but initial tests
>> show that with GCC 6.2 adding
>>
>> #pragma GCC optimize ("no-crossjumping")
>>
>> fixes merging of the op tail jumps.
>>
>> Some quick and dirty benchmarking suggests that the benefit for the
>> interpreter is about 15% (5% speedup on a workload that spends 1/3 in
>> ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
>> vars before the indirect jump is somewhere between a tiny benefit and
>> no effect, certainly not worth introducing extra complexity. Clang 3.8
>> does the correct thing out of the box and is a couple of percent
>> faster than GCC with the pragma.

> That's large enough to be worth doing (although I recall you seeing all jumps commonalized).  We should probably do
thison a per function basis however (either using pragma push option, or function attributes). 

Seems like it would be fine to do it on a per-file basis.  If you're
worried about pessimizing the out-of-line subroutines, we could move
those to a different file --- it's pretty questionable that they're
in execExprInterp.c in the first place, considering they're meant to be
used by more than just that execution method.
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-25 23:51:45 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote:
> >> I haven't had the time to research this properly, but initial tests
> >> show that with GCC 6.2 adding
> >> 
> >> #pragma GCC optimize ("no-crossjumping")
> >> 
> >> fixes merging of the op tail jumps.
> >> 
> >> Some quick and dirty benchmarking suggests that the benefit for the
> >> interpreter is about 15% (5% speedup on a workload that spends 1/3 in
> >> ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
> >> vars before the indirect jump is somewhere between a tiny benefit and
> >> no effect, certainly not worth introducing extra complexity. Clang 3.8
> >> does the correct thing out of the box and is a couple of percent
> >> faster than GCC with the pragma.
> 
> > That's large enough to be worth doing (although I recall you seeing all jumps commonalized).  We should probably do
thison a per function basis however (either using pragma push option, or function attributes).
 
> 
> Seems like it would be fine to do it on a per-file basis.

I personally find per-function annotation ala
__attribute__((optimize("no-crossjumping")))
cleaner anyway.  I tested that, and it seems to work.

Obviously we'd have to hide that behind a configure test.  Could also do
tests based on __GNUC__ / __GNUC_MINOR__, but that seems uglier.


> If you're
> worried about pessimizing the out-of-line subroutines, we could move
> those to a different file --- it's pretty questionable that they're
> in execExprInterp.c in the first place, considering they're meant to be
> used by more than just that execution method.

I indeed am, but having the code in the same file has a minor advantage:
It allows the compiler to partially inline them, if it feels like it
(e.g. moving null checks inline).

Greetings,

Andres Freund



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-25 20:59:27 -0700, Andres Freund wrote:
> On 2017-03-25 23:51:45 -0400, Tom Lane wrote:
> > Andres Freund <andres@anarazel.de> writes:
> > > On March 25, 2017 4:56:11 PM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote:
> > >> I haven't had the time to research this properly, but initial tests
> > >> show that with GCC 6.2 adding
> > >> 
> > >> #pragma GCC optimize ("no-crossjumping")
> > >> 
> > >> fixes merging of the op tail jumps.
> > >> 
> > >> Some quick and dirty benchmarking suggests that the benefit for the
> > >> interpreter is about 15% (5% speedup on a workload that spends 1/3 in
> > >> ExecInterpExpr). My idea of prefetching op->resnull/resvalue to local
> > >> vars before the indirect jump is somewhere between a tiny benefit and
> > >> no effect, certainly not worth introducing extra complexity. Clang 3.8
> > >> does the correct thing out of the box and is a couple of percent
> > >> faster than GCC with the pragma.
> > 
> > > That's large enough to be worth doing (although I recall you seeing all jumps commonalized).  We should probably
dothis on a per function basis however (either using pragma push option, or function attributes).
 
> > 
> > Seems like it would be fine to do it on a per-file basis.
> 
> I personally find per-function annotation ala
> __attribute__((optimize("no-crossjumping")))
> cleaner anyway.  I tested that, and it seems to work.
> 
> Obviously we'd have to hide that behind a configure test.  Could also do
> tests based on __GNUC__ / __GNUC_MINOR__, but that seems uglier.

Checking for this isn't entirely pretty - see my attached attempt at
doing so.  I considered hiding
__attribute__((optimize("no-crossjumping"))) in execInterpExpr.c behind
a macro (like PG_DISABLE_CROSSJUMPING), but I don't really think that
makes things better.

Comments?

Greetings,

Andres Freund

Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
>> I personally find per-function annotation ala
>> __attribute__((optimize("no-crossjumping")))
>> cleaner anyway.  I tested that, and it seems to work.
>> 
>> Obviously we'd have to hide that behind a configure test.  Could also do
>> tests based on __GNUC__ / __GNUC_MINOR__, but that seems uglier.

Agreed.

> Checking for this isn't entirely pretty - see my attached attempt at
> doing so.  I considered hiding
> __attribute__((optimize("no-crossjumping"))) in execInterpExpr.c behind
> a macro (like PG_DISABLE_CROSSJUMPING), but I don't really think that
> makes things better.

I think it would, primarily because if we find out that some other compiler
spells this differently, we could handle it totally within configure.

Isn't our practice to put __attribute__ at the end of a function
declaration or definition, not in the middle someplace?
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-27 09:33:43 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Checking for this isn't entirely pretty - see my attached attempt at
> > doing so.  I considered hiding
> > __attribute__((optimize("no-crossjumping"))) in execInterpExpr.c behind
> > a macro (like PG_DISABLE_CROSSJUMPING), but I don't really think that
> > makes things better.
> 
> I think it would, primarily because if we find out that some other compiler
> spells this differently, we could handle it totally within configure.

I'm unconvinced that we could sensibly map different compiler's options
on the same option name - I'd be surprised if they would have a similar
enough effect.


> Isn't our practice to put __attribute__ at the end of a function
> declaration or definition, not in the middle someplace?

We could move it to the declaration - which doesn't seem like an
improvement here, given that it's about optimization not API changes -
but in definitions you can't move it after the function name:
static Datum
ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
#ifdef HAVE_FUNCATTR_NO_CROSSJUMPING
__attribute__((optimize("no-crossjumping")))
#endif
:
/home/andres/src/postgresql/src/backend/executor/execExprInterp.c:279:1: error: attributes should be specified before
thedeclarator in a function definition
 


Greetings,

Andres Freund



Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-27 09:33:43 -0400, Tom Lane wrote:
>> I think it would, primarily because if we find out that some other compiler
>> spells this differently, we could handle it totally within configure.

> I'm unconvinced that we could sensibly map different compiler's options
> on the same option name - I'd be surprised if they would have a similar
> enough effect.

[ shrug... ]  OK, next question is whether pgindent treats this
layout sanely.
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-27 11:22:40 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-27 09:33:43 -0400, Tom Lane wrote:
> >> I think it would, primarily because if we find out that some other compiler
> >> spells this differently, we could handle it totally within configure.
> 
> > I'm unconvinced that we could sensibly map different compiler's options
> > on the same option name - I'd be surprised if they would have a similar
> > enough effect.
> 
> [ shrug... ]  OK, next question is whether pgindent treats this
> layout sanely.

The patch was run through pgindent, i.e. it seems to be ok with it.

- Andres



Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
As to the point of whether it actually helps or not ...

on gcc 6.3.1 (Fedora 25), yes it does.  Seems to be one "jmp *something"
per EEO_NEXT or EEO_JUMP.

on gcc 4.4.7 (RHEL 6), it makes things *WORSE*.  We go from about half of
the dispatches getting routed through a common location, to *all* of them
(except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF
survives as a separate jump).  This seems like a bug, but there it is.

So this means we'd need some serious research to decide whether to apply
it.  And I'm suspecting we'd end up with a compiler version test.
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
I wrote:
> As to the point of whether it actually helps or not ...
> on gcc 4.4.7 (RHEL 6), it makes things *WORSE*.  We go from about half of
> the dispatches getting routed through a common location, to *all* of them
> (except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF
> survives as a separate jump).  This seems like a bug, but there it is.

So after a bit of googling, this is a very longstanding complaint:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39284
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785

(hm, think I know the second submitter)

I'm not sure we should be relying on a gcc bug fix that's barely four
months old.  Even assuming it fixes the issue without new regressions,
most people are not going to have it anytime soon.

My feeling at this point is that we might be better off disabling
the computed-goto case by default.  At the very least, we're going
to need a version check that restricts it to latest gcc.
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-27 11:52:05 -0400, Tom Lane wrote:
> As to the point of whether it actually helps or not ...
> 
> on gcc 6.3.1 (Fedora 25), yes it does.  Seems to be one "jmp *something"
> per EEO_NEXT or EEO_JUMP.
> 
> on gcc 4.4.7 (RHEL 6), it makes things *WORSE*.  We go from about half of
> the dispatches getting routed through a common location, to *all* of them
> (except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF
> survives as a separate jump).  This seems like a bug, but there it is.

Gah :(.  I have gcc 5,6,7 here, but nothing earlier.  I'm now inclined
to go the version check routine, for the individual versions we can (and
want) confirm this on...  I'm not too concerned about not doing so on
gcc 4.4 or older...


> So this means we'd need some serious research to decide whether to apply
> it.  And I'm suspecting we'd end up with a compiler version test.

Indeed.

- Andres



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-27 12:18:37 -0400, Tom Lane wrote:
> I wrote:
> > As to the point of whether it actually helps or not ...
> > on gcc 4.4.7 (RHEL 6), it makes things *WORSE*.  We go from about half of
> > the dispatches getting routed through a common location, to *all* of them
> > (except one; for some odd reason the first EEO_NEXT in EEOP_NULLIF
> > survives as a separate jump).  This seems like a bug, but there it is.
> 
> So after a bit of googling, this is a very longstanding complaint:
> 
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39284
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71785
> 
> (hm, think I know the second submitter)

I don't think that's precisely the same issue - as long as some of the
goto branches survive, we're not hitting the full brunt of the compgoto
thing.  I think we're essentially avoiding some of that because we're
"precomputing" the dispatch_table lookup.

To count the number of jumps I used:
gdb  -batch -ex 'disassemble/s ExecInterpExpr' execExprInterp.o|grep jmpq|grep -v ExecInterpExpr|wc -l
which'll only include indirect jumps (since otherwise jumps will look
like "jmpq   0x35f2 <ExecInterpExpr+2066>")
        standard flags        -fno-crossjumping
gcc-5 (5.4.1-8):    34            82
gcc-6 (6.3.0-8):    34            82
gcc-7 (7.0.1):        71            108
gcc-snapshot:        72            108

So that doesn't look too bad.


> My feeling at this point is that we might be better off disabling
> the computed-goto case by default.  At the very least, we're going
> to need a version check that restricts it to latest gcc.

In my measurements it's still faster in at least gcc-5/6, even without
the option (largely because it avoids array bounds checks on the jump
table built for the switch).

Greetings,

Andres Freund



Re: WIP: Faster Expression Processing v4

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2017-03-27 12:18:37 -0400, Tom Lane wrote:
>> My feeling at this point is that we might be better off disabling
>> the computed-goto case by default.  At the very least, we're going
>> to need a version check that restricts it to latest gcc.

> In my measurements it's still faster in at least gcc-5/6, even without
> the option (largely because it avoids array bounds checks on the jump
> table built for the switch).

Hm.  What test cases are you using?
        regards, tom lane



Re: WIP: Faster Expression Processing v4

From
Andres Freund
Date:
On 2017-03-27 13:30:11 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2017-03-27 12:18:37 -0400, Tom Lane wrote:
> >> My feeling at this point is that we might be better off disabling
> >> the computed-goto case by default.  At the very least, we're going
> >> to need a version check that restricts it to latest gcc.
> 
> > In my measurements it's still faster in at least gcc-5/6, even without
> > the option (largely because it avoids array bounds checks on the jump
> > table built for the switch).
> 
> Hm.  What test cases are you using?

I used tpc-h - seems like a realistic enough testcase.

Andres