Re: a raft of parallelism-related bug fixes - Mailing list pgsql-hackers

From Robert Haas
Subject Re: a raft of parallelism-related bug fixes
Date
Msg-id CA+Tgmobt+N03WguRO-2UOJRB4HccALPDC8xW8h5gSY8fVM3tJA@mail.gmail.com
Whole thread Raw
In response to Re: a raft of parallelism-related bug fixes  (Simon Riggs <simon@2ndQuadrant.com>)
Responses Re: a raft of parallelism-related bug fixes  (Noah Misch <noah@leadboat.com>)
Re: a raft of parallelism-related bug fixes  (Andrew Dunstan <andrew@dunslane.net>)
Re: a raft of parallelism-related bug fixes  (Robert Haas <robertmhaas@gmail.com>)
Re: a raft of parallelism-related bug fixes  (Simon Riggs <simon@2ndQuadrant.com>)
Re: a raft of parallelism-related bug fixes  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On Sat, Oct 17, 2015 at 9:16 AM, Simon Riggs <simon@2ndquadrant.com> wrote:
> From reading this my understanding is that there isn't a test suite included
> with this commit?

Right.  The patches on the thread contain code that can be used for
testing, but the committed code does not itself include test coverage.
I welcome thoughts on how we could perform automated testing of this
code.  I think at least part of the answer is that I need to press on
toward getting the rest of Amit's parallel sequential scan patch
committed, because then this becomes a user-visible feature and I
expect that to make it much easier to find whatever bugs remain.  A
big part of the difficulty in testing this up until now is that I've
been building towards, hey, we have parallel query.  Until we actually
do, you need to write C code to test this, which raises the bar
considerably.

Now, that does not mean we shouldn't test this in other ways, and of
course I have, and so have Amit and other people from the community -
of late, Noah Misch and Haribabu Kommi have found several bugs through
code inspection and testing, which included some of the same ones that
I was busy finding and fixing using the test code attached to this
thread.  That's one of the reasons why I wanted to press forward with
getting the fixes for those bugs committed. It's just a waste of
everybody's time if we re-finding known bugs for which fixes already
exist.

But the question of how to test this in the buildfarm is a good one,
and I don't have a complete answer.  Once the rest of this goes in,
which I hope will be soon, we can EXPLAIN or EXPLAIN ANALYZE or just
straight up run parallel queries in the regression test suite and see
that they behave as expected.  But I don't expect that to provide
terribly good test coverage.  One idea that I think would provide
*excellent* test coverage is to take the test code included on this
thread and run it on the buildfarm.  The idea of the code is to
basically run the regression test suite with every parallel-eligible
query forced to unnecessarily use parallelism.  Turning that and
running 'make check' found, directly or indirectly, all of these bugs.
Doing that on the whole buildfarm would probably find more.

However, I'm pretty sure that we don't want to switch the *entire*
buildfarm to using lots of unnecessary parallelism.  What we might be
able to do is have some critters that people spin up for this precise
purpose.  Just like we currently have CLOBBER_CACHE_ALWAYS buildfarm
members, we could have GRATUITOUSLY_PARALLEL buildfarm members.  If
Andrew is willing to add buildfarm support for that option and a few
people are willing to run critters in that mode, I will be happy -
more than happy, really - to put the test code into committable form,
guarded by a #define, and away we go.

Of course, other ideas for testing are also welcome.

> I've tried to review the Gather node commit and I note that the commit
> message contains a longer description of the functionality in that patch
> than any comments in the patch as a whole. No design comments, no README, no
> file header comments. For such a major feature that isn't acceptable - I
> would reject a patch from others on that basis alone (and have done so). We
> must keep the level of comments high if we are to encourage wider
> participation in the project.

It's good to have your perspective on how this can be improved, and
I'm definitely willing to write more documentation.  Any lack in that
area is probably due to being too close to the subject area, having
spent several years on parallelism in general, and 200+ emails on
parallel sequential scan in particular.  Your point about the lack of
a good header file comment for execParallel.c is a good one, and I'll
rectify that next week.

It's worth noting, though, that the executor files in general don't
contain great gobs of comments, and the executor README even has this
vintage 2001 comment:

XXX a great deal more documentation needs to be written here...

Well, yeah.  It's taken me a long time to understand how the executor
actually works, and there are parts of it - particularly related to
EvalPlanQual - that I still don't fully understand.  So some of the
lack of comments in, for example, nodeGather.c is because it copies
the style of other executor nodes, like nodeSeqscan.c.  It's not
exactly clear to me what more to document there.  You either
understand what a rescan node is, in which case the code for each
node's rescan method tends to be fairly self-evident, or you don't -
but that clearly shouldn't be re-explained in every file.  So I guess
what I'm saying is I could use some advice on what kinds things would
be most useful to document, and where to put that documentation.

Right now, the best explanation of how parallelism works is in
src/backend/access/transam/README.parallel -- but, as you rightly
point out, that doesn't cover the executor bits.  Should we have SGML
documentation under "VII. Internals" that explains what's under the
hood in the same way that we have sections for "Database Physical
Storage" and "PostgreSQL Coding Conventions"?  Should the stuff in the
existing README.parallel be moved there?  Or I could just add some
words to src/backend/executor/README to cover the parallel executor
stuff, if that is preferred.  Advice?

Also, regardless of how we document what's going on at the code level,
I think we probably should have a section *somewhere* in the main SGML
documentation that kind of explains the general concepts behind
parallel query from a user/DBA perspective.  But I don't know where to
put it.  Under "Server Administration"?  Exactly what to explain there
needs some thought, too.  I'm sort of wondering if we need two
chapters in the documentation on this, one that covers it from a
user/DBA perspective and the other of which covers it from a hacker
perspective.  But then maybe the hacker stuff should just go in README
files.  I'm not sure.  I may have to try writing some of this and see
how it goes, but advice is definitely appreciated.

I am happy to definitively commit to writing whatever documentation
the community feels is necessary here, and I will do that certainly
before end of development for 9.6 and hopefully much sooner than that.
I will do that even if I don't get any specific feedback on what to
write and where to put it, but the more feedback I get, the better the
result will probably be.  Some of the reason this hasn't been done
already is because we're still getting the infrastructure into place,
and we're fixing and adjusting things as we go along, so while the
overall picture isn't changing much, there are bits of the design that
are still in flux as we realize, oh, crap, that was a dumb idea.  As
we get a clearer idea what will be in 9.6, it will get easier to
present the overall picture in a coherent way.

> So reviewing patch 13 isn't possible without prior knowledge.

The basic question for patch 13 is whether ephemeral record types can
occur in executor tuples in any contexts that I haven't identified.  I
know that a tuple table slot can contain have a column that is of type
record or record[], and those records can themselves contain
attributes of type record or record[], and so on as far down as you
like.  I *think* that's the only case.  For example, I don't believe
that a TupleTableSlot can contain a *named* record type that has an
anonymous record buried down inside of it somehow.  But I'm not
positive I'm right about that.

> Hoping we'll be able to find some time on this at PGConf.eu; thanks for
> coming over.

Sure thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Petr Jelinek
Date:
Subject: Re: [DOCS] max_worker_processes on the standby
Next
From: Robert Haas
Date:
Subject: Re: buildfarm failures on crake and sittella