Re: BUG #18334: Segfault when running a query with parallel workers - Mailing list pgsql-bugs

From Thomas Munro
Subject Re: BUG #18334: Segfault when running a query with parallel workers
Date
Msg-id CA+hUKG+7KA6wQGx4yFBNj5KaTooErV2Ov1+m_ers4DVZWJ_mKg@mail.gmail.com
Whole thread Raw
In response to Re: BUG #18334: Segfault when running a query with parallel workers  (Marcin Barczyński <mba.ogolny@gmail.com>)
Responses Re: BUG #18334: Segfault when running a query with parallel workers
List pgsql-bugs
On Thu, May 23, 2024 at 11:59 PM Marcin Barczyński <mba.ogolny@gmail.com> wrote:
> (gdb) print *segment_map
> $4 = {segment = 0x56134dfa2dd8, mapped_address = 0x7f309faf4000 "",
> header = 0x7f309faf4000, fpm = 0x7f309faf4038, pagemap =
> 0x7f309faf4480}
>
> (gdb) print pageno
> $5 = 196979

Hmm.  Page 196979 is an offset of around 769MB within the segment
(pages here are 4k).  What does segment_map->segment->mapped_size
show?  It's OK for the pagemap to contain zeroes, but it should
contain non-zero values for pages that contain the start of an
allocated object.  The actual dsa_pointer has been optimised out but
should be visible from frame #1 as batch->chunks.  I think its higher
24 bits should contain 13 (the element of area->segment_maps that
seems to correspond to the above), and its lower 40 bits should
contain that number ~769MB.

The things that are unusually high so far in your emails are worker
count and work_mem, so that it can make quite large hash tables, in
your case up to 13GB.  Perhaps there is a silly arithmetic/type
problem around large numbers somewhere (perhaps somewhere near 4GB+
segments, but I don't expect segment #13 to be very large IIRC).  But
then that would fail more often I think...  It seems to be
rare/intermittent, and yet you don't have any batching or re-bucketing
in your problem (nbatch and nbuckets have their original values), so a
lot of the more complex parts of the PHJ code are not in play here.
Hmm.

I wondered if the tricky edge case where a segment gets unmapped and
then then remapped in the same slot could be leading to segment
confusion.  That does involve a bit of memory order footwork.  What
CPU architecture is this?  But alas I can't come up with any case
where that could go wrong even if there is an unknown bug in that
area, because the no-rebatching, no-rebucketing case doesn't free
anything until the end when it frees everything (ie it never frees
something and then allocate, a requirement for slot re-use).



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #18477: A specific SQL query with "ORDER BY ... NULLS FIRST" is performing poorly if an ordering column is n
Next
From: Thomas Munro
Date:
Subject: Re: BUG #18334: Segfault when running a query with parallel workers