using custom scan nodes to prototype parallel sequential scan - Mailing list pgsql-hackers

From Robert Haas
Subject using custom scan nodes to prototype parallel sequential scan
Date
Msg-id CA+TgmoYp6C=LY8Cf26csp=0E5WsYcr4_kKHvbk2cG1BvFwF0og@mail.gmail.com
Whole thread Raw
Responses Re: using custom scan nodes to prototype parallel sequential scan  (Andres Freund <andres@2ndquadrant.com>)
Re: using custom scan nodes to prototype parallel sequential scan  (Simon Riggs <simon@2ndQuadrant.com>)
List pgsql-hackers
On Wed, Oct 15, 2014 at 2:55 PM, Simon Riggs <simon@2ndquadrant.com> wrote:
> Something usable, with severe restrictions, is actually better than we
> have now. I understand the journey this work represents, so don't be
> embarrassed by submitting things with heuristics and good-enoughs in
> it. Our mentor, Mr.Lane, achieved much by spreading work over many
> releases, leaving others to join in the task.

It occurs to me that, now that the custom-scan stuff is committed, it
wouldn't be that hard to use that, plus the other infrastructure we
already have, to write a prototype of parallel sequential scan.  Given
where we are with the infrastructure, there would be a number of
unhandled problems, such as deadlock detection (needs group locking or
similar), assessment of quals as to parallel-safety (needs
proisparallel or similar), general waterproofing to make sure that
pushing down a qual we shouldn't does do anything really dastardly
like crash the server (another written but yet-to-be-published patch
adds a bunch of relevant guards), and snapshot sharing (likewise).
But if you don't do anything weird, it should basically work.

I think this would be useful for a couple of reasons.  First, it would
be a demonstrable show of progress, illustrating how close we are to
actually having something you can really deploy.  Second, we could use
it to demonstrate how the remaining infrastructure patches close up
gaps in the initial prototype.  Third, it would let us start doing
real performance testing.  It seems pretty clear that a parallel
sequential scan of data that's in memory (whether the page cache or
the OS cache) can be accelerated by having multiple processes scan it
in parallel.  But it's much less clear what will happen when the data
is being read in from disk.  Does parallelism help at all?  What
degree of parallelism helps?  Do we break OS readahead so badly that
performance actually regresses?  These are things that are likely to
need a fair amount of tuning before this is ready for prime time, so
being able to start experimenting with them in advance of all of the
infrastructure being completely ready seems like it might help.

Thoughts?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Kevin Grittner
Date:
Subject: Re: Compiler warning in master branch
Next
From: Simon Riggs
Date:
Subject: Teaching pg_dump to use NOT VALID constraints