Re: asynchronous and vectorized execution - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: asynchronous and vectorized execution |
Date | |
Msg-id | 20160511161928.qyaqao4hu3t6ztiu@alap3.anarazel.de Whole thread Raw |
In response to | Re: asynchronous and vectorized execution (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On 2016-05-11 10:32:20 -0400, Robert Haas wrote: > On Tue, May 10, 2016 at 8:50 PM, Andres Freund <andres@anarazel.de> wrote: > > That seems to suggest that we need to restructure how we get to calling > > fmgr functions, before worrying about the actual fmgr call. > > Any ideas on how to do that? ExecMakeFunctionResultNoSets() isn't > really doing a heck of a lot. Changing FuncExprState to use an array > rather than a linked list to store its arguments might help some. We > could also consider having an optimized path that skips the fn_strict > stuff if we can somehow deduce that no NULLs can occur in this > context, but that's a lot of work and new infrastructure. I feel like > maybe there's something higher-level we could do that would help more, > but I don't know what it is. I think it's not just ExecMakeFunctionResultNoSets, it's the whole call-stack which needs to be optimized together. E.g. look at a few performance metrics for a simple seqscan query with a bunch of ORed equality constraints: SELECT count(*) FROM pgbench_accounts WHERE abalance = -1 OR abalance = -2 OR abalance = -3 OR abalance = -4 OR abalance= -5 OR abalance = -6 OR abalance = -7 OR abalance = -8 OR abalance = -9 OR abalance = -10; perf record -g -p 27286 -F 5000 -e cycles:ppp,branch-misses,L1-icache-load-misses,iTLB-load-misses,L1-dcache-load-misses,dTLB-load-misses,LLC-load-misses sleep3 6K cycles:ppp 6K branch-misses 1K L1-icache-load-misses 472 iTLB-load-misses 5K L1-dcache-load-misses 6K dTLB-load-misses 6K LLC-load-misses You can see that a number of events sample at a high rate, especially when you take the cycle samples into account. cycles: + 32.35% postgres postgres [.] ExecMakeFunctionResultNoSets + 14.51% postgres postgres [.] slot_getattr + 5.50% postgres postgres [.] ExecEvalOr + 5.22% postgres postgres [.] check_stack_depth branch-misses: + 73.77% postgres postgres [.] ExecQual + 17.83% postgres postgres [.] ExecEvalOr + 1.49% postgres postgres [.] heap_getnext L1-icache-load-misses: + 4.71% postgres [kernel.kallsyms] [k] update_curr + 4.37% postgres postgres [.] hash_search_with_hash_value + 3.91% postgres postgres [.] heap_getnext + 3.81% postgres [kernel.kallsyms] [k] task_tick_fair iTLB-load-misses: + 27.57% postgres postgres [.] LWLockAcquire + 18.32% postgres postgres [.] hash_search_with_hash_value + 7.09% postgres postgres [.] ExecMakeFunctionResultNoSets + 3.06% postgres postgres [.] ExecEvalConst L1-dcache-load-misses: + 20.35% postgres postgres [.] ExecMakeFunctionResultNoSets + 12.31% postgres postgres [.] check_stack_depth + 8.84% postgres postgres [.] heap_getnext + 8.00% postgres postgres [.] slot_deform_tuple + 7.15% postgres postgres [.] HeapTupleSatisfiesMVCC dTLB-load-misses: + 50.13% postgres postgres [.] ExecQual + 41.36% postgres postgres [.] ExecEvalOr + 2.96% postgres postgres [.] hash_search_with_hash_value + 1.30% postgres postgres [.] PinBuffer.isra.3 + 1.19% postgres postgres [.] heap_page_prune_op LLC-load-misses: + 24.25% postgres postgres [.] slot_deform_tuple + 17.45% postgres postgres [.] CheckForSerializableConflictOut + 10.52% postgres postgres [.] heapgetpage + 9.55% postgres postgres [.] HeapTupleSatisfiesMVCC + 7.52% postgres postgres [.] ExecMakeFunctionResultNoSets For this workload, we expect a lot of LLC-load-misses as the workload is lot bigger than memory, and it makes sense that they're in slot_deform_tuple(),heapgetpage(), HeapTupleSatisfiesMVCC() (but uh CheckForSerializableConflictOut?). One avenue to optimize is to make those accesses easier to predict/prefetch, which they're atm likely not. But leaving that aside, we can see that a lot of the cost is distributed over ExecQual, ExecEvalOr, ExecMakeFunctionResultNoSets - all of which judiciously use linked list. I suspect that by simplifying these functions / datastructures *AND* by calling them over a batch of tuples, instead of one-by-one we'd limit the time spent in them considerably. Greetings, Andres Freund
pgsql-hackers by date: