Re: pgsql: Add parallel-aware hash joins. - Mailing list pgsql-hackers

From Robert Haas
Subject Re: pgsql: Add parallel-aware hash joins.
Date
Msg-id CA+TgmoYn8avuxg=dS8mbppjLn0X7AXMduU+dopeN73eZmP2u6w@mail.gmail.com
Whole thread Raw
In response to Re: pgsql: Add parallel-aware hash joins.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: pgsql: Add parallel-aware hash joins.
List pgsql-hackers
On Wed, Jan 24, 2018 at 2:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I find that to be a completely bogus straw-man argument.  The point of
> looking at the prairiedog time series is just to see a data series in
> which the noise level is small enough to discern the signal.  If anyone's
> got years worth of data off a more modern machine, and they can extract
> a signal from that, by all means let's consider that data instead.  But
> there's no clear argument (or at least you have not made one) that says
> that prairiedog's relative timings don't match what we'd get on more
> modern machines.

There is no need to collect years of data in order to tell whether or
not the time to run the tests has increased by as much on developer
machines as it has on prairiedog.  You showed the time going from 3:36
to 8:09 between 2014 and the present.  That is a 2.26x increase.  It
is obvious from the numbers I posted before that no such increase has
taken place in the time it takes to run 'make check' on my relatively
modern laptop.  Whatever difference exists is measured in
milliseconds.

> so join has gotten about 1 second slower since v10, and that time is
> coming entirely out of developers' hides despite parallelism because
> it was already the slowest in its group.
>
> So I continue to maintain that an unreasonable fraction of the total
> resources devoted to the regular regression tests is going into these
> new hashjoin tests.

I think there is an affirmative desire on the part of many
contributors to have newer features tested more thoroughly than old
ones were.  That will tend to mean that features added more recently
have test suites that are longer-running compared to the value of the
feature they test than what we had in the past.  When this has been
discussed at developer meetings, everyone except you (and to a lesser
extent me) has been in favor of this.  Even if that meant that you had
to wait 1 extra second every time you run 'make check', I would judge
that worthwhile.  But it probably doesn't, because there are a lot of
things that can be done to improve this situation, such as...

> Based on these numbers, it seems like one easy thing we could do to
> reduce parallel check time is to split the plpgsql test into several
> scripts that could run in parallel.  But independently of that,
> I think we need to make an effort to push hashjoin's time back down.

...this.  Also, the same technique could probably be applied to the
join test itself.  I think Thomas just added the tests to that file
because it already existed, but there's nothing to say that the file
couldn't be split into several chunks.  On a quick look, it looks to
me as though that file is testing a lot of pretty different things,
and it's one of the largest test case files, accounting for ~3% of the
total test suite by itself.

Another thing you could do is consider applying the patch Thomas
already posted to reduce the size of the tables involved.  The problem
is that, for you and the buildfarm to be happy, the tests have to (1)
run near-instantaneously even on thoroughly obsolete hardware, (2)
give exactly the same answers on 32-bit systems, 64-bit systems,
Linux, Windows, AIX, HP-UX, etc., and (3) give those same exact
answers 100% deterministically on all of those platforms.  Parallel
query is inherently non-deterministic about things like how much work
goes to each worker, and I think that really small tests will tend to
show more edge cases like one worker not doing anything.  So it might
be that if we cut down the sizes of the test cases we'll spend more
time troubleshooting the resulting instability than any developer time
we would've saved by reducing the runtime.  But we can try it.

>> One caveat is that old machines also
>> somewhat approximate testing with more instrumentation / debugging
>> enabled (say valgrind, CLOBBER_CACHE_ALWAYS, etc). So removing excessive
>> test overhead has still quite some benefits. But I definitely do not
>> want to lower coverage to achieve it.
>
> I don't want to lower coverage either.  I do want some effort to be
> spent on achieving test coverage intelligently, rather than just throwing
> large test cases at the code without consideration of the costs.

I don't believe that any such thing is occurring, and I think it's
wrong of you to imply that these test cases were added
unintelligently.  To me, that seems like an ad hominum attack on both
Thomas (who spent a year or more developing the feature those test
cases exercise) and Andres (who committed them).

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: pgsql: Add parallel-aware hash joins.
Next
From: Robert Haas
Date:
Subject: Re: pgsql: Add parallel-aware hash joins.