infinite loop in parallel hash joins / DSA / get_best_segment - Mailing list pgsql-hackers

From Tomas Vondra
Subject infinite loop in parallel hash joins / DSA / get_best_segment
Date
Msg-id 194c0706-c65b-7d81-ab32-2c248c3e2344@2ndquadrant.com
Whole thread Raw
Responses Re: infinite loop in parallel hash joins / DSA / get_best_segment
List pgsql-hackers
Hi,

While performing some benchmarks on REL_11_STABLE (at 444455c2d9), I've
repeatedly hit an apparent infinite loop on TPC-H query 4. I don't know
what exactly are the triggering conditions, but the symptoms are these:

1) A parallel worker" process is consuming 100% CPU, with per for
reporting profile like this:

    34.66%  postgres          [.] get_segment_by_index
    29.44%  postgres          [.] get_best_segment
    29.22%  postgres          [.] unlink_segment.isra.2
     6.66%  postgres          [.] fls
     0.02%  [unknown]         [k] 0xffffffffb10014b0

So all the time seems to be spent within get_best_segment.

2) The backtrace looks like this (full backtrace attached):

    #0  0x0000561a748c4f89 in get_segment_by_index
    #1  0x0000561a748c5653 in get_best_segment
    #2  0x0000561a748c67a9 in dsa_allocate_extended
    #3  0x0000561a7466ddb4 in ExecParallelHashTupleAlloc
    #4  0x0000561a7466e00a in ExecParallelHashTableInsertCurrentBatch
    #5  0x0000561a7466fe00 in ExecParallelHashJoinNewBatch
    #6  ExecHashJoinImpl
    #7  ExecParallelHashJoin
    #8  ExecProcNode
    ...

3) The infinite loop seems to be pretty obvious - after setting
breakpoint on get_segment_by_index we get this:

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

Breakpoint 1, get_segment_by_index (area=0x560c03626e58, index=3) ...
(gdb) c
Continuing.

That is, we call the function with the same index over and over.

Why is that? Well:

(gdb) print *area->segment_maps[3].header
$1 = {magic = 216163851, usable_pages = 512, size = 2105344, prev = 3,
next = 3, bin = 0, freed = false}

So, we loop forever.

I don't know what exactly are the triggering conditions here. I've only
ever observed the issue on TPC-H with scale 16GB, partitioned lineitem
table and work_mem set to 8MB and query #4. And it seems I can reproduce
it pretty reliably.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: Collation versioning
Next
From: Thomas Munro
Date:
Subject: Re: infinite loop in parallel hash joins / DSA / get_best_segment