Re: pgsql: Add parallel-aware hash joins. - Mailing list pgsql-committers

From Thomas Munro
Subject Re: pgsql: Add parallel-aware hash joins.
Date
Msg-id CAEepm=0WxwzpHVHt3PcWHBV=L3k3FDb6dvMq1A2Li49LGBa7TA@mail.gmail.com
Whole thread Raw
In response to Re: pgsql: Add parallel-aware hash joins.  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: pgsql: Add parallel-aware hash joins.  (Andres Freund <andres@anarazel.de>)
List pgsql-committers
On Fri, Dec 22, 2017 at 1:48 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> I don't think that's quite it, because it should never have set
> 'writing' for any batch number >= nbatch.
>
> It's late here, but I'll take this up tomorrow and either find a fix
> or figure out how to avoid antisocial noise levels on the build farm
> in the meantime.

Not there yet but I learned some things and am still working on it.  I
spent a lot of time trying to reproduce the assertion failure, and
succeeded exactly once.  Unfortunately the one time I managed do to
that I'd built with clang -O2 and got a core file that I couldn't get
much useful info out of, and I've been trying to do it again with -O0
ever since without luck.  The time I succeeded, I reproduced it by
creating the tables "simple" and "bigger_than_it_looks" from join.sql
and then doing this in a loop:

  set min_parallel_table_scan_size = 0;
  set parallel_setup_cost = 0;
  set work_mem = '192kB';

  explain analyze select count(*) from simple r join
bigger_than_it_looks s using (id);

The machine that it happened on is resource constrained, and exhibits
another problem: though the above query normally runs in ~20ms,
sometimes it takes several seconds and occasionally much longer.  That
never happens on fast development systems or test servers which run it
quickly every time, and it doesn't happen on my 2 core slow system if
I have only two workers (or one worker + leader).  I dug into that and
figured out what was going wrong and wrote that up separately[1],
because I think it's an independent bug needing to be fixed, not the
root cause here.  However, I think it could easily be contributing to
the timing required to trigger the bug we're looking for.

Andres, your machine francolin crashed -- got a core file?

[1] https://www.postgresql.org/message-id/CAEepm%3D0NWKehYw7NDoUSf8juuKOPRnCyY3vuaSvhrEWsOTAa3w%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com


pgsql-committers by date:

Previous
From: Alvaro Herrera
Date:
Subject: pgsql: Minor edits to catalog files and scripts
Next
From: Andres Freund
Date:
Subject: Re: pgsql: Add parallel-aware hash joins.