Re: asynchronous and vectorized execution - Mailing list pgsql-hackers

From Andres Freund
Subject Re: asynchronous and vectorized execution
Date
Msg-id 20160511161928.qyaqao4hu3t6ztiu@alap3.anarazel.de
Whole thread Raw
In response to Re: asynchronous and vectorized execution  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 2016-05-11 10:32:20 -0400, Robert Haas wrote:
> On Tue, May 10, 2016 at 8:50 PM, Andres Freund <andres@anarazel.de> wrote:
> > That seems to suggest that we need to restructure how we get to calling
> > fmgr functions, before worrying about the actual fmgr call.
> 
> Any ideas on how to do that?  ExecMakeFunctionResultNoSets() isn't
> really doing a heck of a lot.  Changing FuncExprState to use an array
> rather than a linked list to store its arguments might help some.   We
> could also consider having an optimized path that skips the fn_strict
> stuff if we can somehow deduce that no NULLs can occur in this
> context, but that's a lot of work and new infrastructure.  I feel like
> maybe there's something higher-level we could do that would help more,
> but I don't know what it is.

I think it's not just ExecMakeFunctionResultNoSets, it's the whole
call-stack which needs to be optimized together.

E.g. look at a few performance metrics for a simple seqscan query with a
bunch of ORed equality constraints:
SELECT count(*) FROM pgbench_accounts WHERE abalance = -1 OR abalance = -2 OR abalance = -3 OR abalance = -4 OR
abalance= -5 OR abalance = -6 OR abalance = -7 OR abalance = -8 OR abalance = -9 OR abalance = -10;
 

perf record -g -p 27286 -F 5000 -e
cycles:ppp,branch-misses,L1-icache-load-misses,iTLB-load-misses,L1-dcache-load-misses,dTLB-load-misses,LLC-load-misses
sleep3
 
6K cycles:ppp
6K branch-misses
1K L1-icache-load-misses
472 iTLB-load-misses
5K L1-dcache-load-misses
6K dTLB-load-misses
6K LLC-load-misses

You can see that a number of events sample at a high rate, especially
when you take the cycle samples into account.

cycles:
+   32.35%  postgres  postgres           [.] ExecMakeFunctionResultNoSets
+   14.51%  postgres  postgres           [.] slot_getattr
+    5.50%  postgres  postgres           [.] ExecEvalOr
+    5.22%  postgres  postgres           [.] check_stack_depth

branch-misses:
+   73.77%  postgres  postgres           [.] ExecQual
+   17.83%  postgres  postgres           [.] ExecEvalOr
+    1.49%  postgres  postgres           [.] heap_getnext

L1-icache-load-misses:
+    4.71%  postgres  [kernel.kallsyms]  [k] update_curr
+    4.37%  postgres  postgres           [.] hash_search_with_hash_value
+    3.91%  postgres  postgres           [.] heap_getnext
+    3.81%  postgres  [kernel.kallsyms]  [k] task_tick_fair

iTLB-load-misses:
+   27.57%  postgres  postgres           [.] LWLockAcquire
+   18.32%  postgres  postgres           [.] hash_search_with_hash_value
+    7.09%  postgres  postgres           [.] ExecMakeFunctionResultNoSets
+    3.06%  postgres  postgres           [.] ExecEvalConst

L1-dcache-load-misses:
+   20.35%  postgres  postgres           [.] ExecMakeFunctionResultNoSets
+   12.31%  postgres  postgres           [.] check_stack_depth
+    8.84%  postgres  postgres           [.] heap_getnext
+    8.00%  postgres  postgres           [.] slot_deform_tuple
+    7.15%  postgres  postgres           [.] HeapTupleSatisfiesMVCC

dTLB-load-misses:
+   50.13%  postgres  postgres           [.] ExecQual
+   41.36%  postgres  postgres           [.] ExecEvalOr
+    2.96%  postgres  postgres           [.] hash_search_with_hash_value
+    1.30%  postgres  postgres           [.] PinBuffer.isra.3
+    1.19%  postgres  postgres           [.] heap_page_prune_op

LLC-load-misses:
+   24.25%  postgres  postgres           [.] slot_deform_tuple
+   17.45%  postgres  postgres           [.] CheckForSerializableConflictOut
+   10.52%  postgres  postgres           [.] heapgetpage
+    9.55%  postgres  postgres           [.] HeapTupleSatisfiesMVCC
+    7.52%  postgres  postgres           [.] ExecMakeFunctionResultNoSets


For this workload, we expect a lot of LLC-load-misses as the workload is
lot bigger than memory, and it makes sense that they're in
slot_deform_tuple(),heapgetpage(), HeapTupleSatisfiesMVCC() (but uh
CheckForSerializableConflictOut?).  One avenue to optimize is to make
those accesses easier to predict/prefetch, which they're atm likely not.

But leaving that aside, we can see that a lot of the cost is distributed
over ExecQual, ExecEvalOr, ExecMakeFunctionResultNoSets - all of which
judiciously use linked list.  I suspect that by simplifying these
functions / datastructures *AND* by calling them over a batch of tuples,
instead of one-by-one we'd limit the time spent in them considerably.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: ALTER TABLE lock downgrades have broken pg_upgrade
Next
From: Robert Haas
Date:
Subject: Re: asynchronous and vectorized execution