Hi,
While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've
repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
what exactly are the triggering conditions, but the symptoms are these:
1) A parallel worker" process is consuming 100% CPU, with per for
reporting profile like this:
34.66% postgres [.] get_segment_by_index
29.44% postgres [.] get_best_segment
29.22% postgres [.] unlink_segment.isra.2
6.66% postgres [.] fls
0.02% [unknown] [k] 0xffffffffb10014b0
So all the time seems to be spent within get_best_segment.
2) The backtrace looks like this (full backtrace attached):
#0 0x0000561a748c4f89 in get_segment_by_index
#1 0x0000561a748c5653 in get_best_segment
#2 0x0000561a748c67a9 in dsa_allocate_extended
#3 0x0000561a7466ddb4 in ExecParallelHashTupleAlloc
#4 0x0000561a7466e00a in ExecParallelHashTableInsertCurrentBatch
#5 0x0000561a7466fe00 in ExecParallelHashJoinNewBatch
#6 ExecHashJoinImpl
#7 ExecParallelHashJoin
#8 ExecProcNode
...
3) The infinite loop seems to be pretty obvious - after setting
breakpoint on get_segment_by_index we get this:
Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.
Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.
Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.
That is, we call the function with the same index over and over.
Why is that? Well:
(gdb) print *area->segment_maps[3].header
$1 = {magic = 216163851, usable_pages = 512, size = 2105344, prev = 3,
next = 3, bin = 0, freed = false}
So, we loop forever.
I don't know what exactly are the triggering conditions here. I've only
ever observed the issue on TPC-H with scale 16GB, partitioned lineitem
table and work_mem set to 8MB and query #4. And it seems I can reproduce
it pretty reliably.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services