XPRS - Mailing list pgsql-hackers

From Thomas Munro
Subject XPRS
Date
Msg-id CA+hUKGL-Fo9mZyFK1tdmzFng2puRBrgROsCiB1=n7wP79mTZ+g@mail.gmail.com
Whole thread Raw
Responses Re: XPRS
List pgsql-hackers
Hello,

After rereading some old papers recently, I wanted to share some
thoughts about XPRS and modern PostgreSQL.  XPRS stood for "eXtended
Postgres on RAID and Sprite", and was a research project done nearly
three decades ago at Berkeley by the POSTGRES group working with
operating system researchers, on a shared memory machine with a lot of
CPUs for the time (12).

As far as I can tell (and if anyone knows how to find out, I'd love to
know), the parallel query parts of the XPRS system described in
various writings by Wei Hong and Michael Stonebraker are actually
present in the POSTGRES 4.2 tarball and were removed in Postgres95.
Evidence: 4.2's parallel code is all wrapped in #ifdef sequent, and we
know that XPRS ran on a Sequent Symmetry; the parallel hash join
algorithm matches the description given in Hong's doctoral thesis; the
page and range based parallel scans seen in various places also seem
to match.

Hong's thesis covers a lot of material and I certainly haven't
understood all of it, but basically it's about how to share CPU, IO
bandwidth and memory out fairly and dynamically at execution time so
you're using the whole system efficiently.  Facets of this problem
obviously keep coming up on this mailing list (see practically any
discussion of parallel degree, admission control, prefetch or
work_mem, all of which we punt on by deferring to user supplied GUCs
and scan size-based heuristics).

Here are three things I wanted to highlight from Hong's 1991 paper[1]
(later writings elaborate greatly but this one is short and much
easier to read than the thesis and sets the scene):

1.  "The overall performance goal of a multiprocessor database system
is to obtain increased throughput as well as reduced response time in
a multiuser environment.  The objective function that XPRS uses for
query optimization is a combination of resource consumption and
response time as follows:  cost = resource_consumption + w *
response_time, where w is a system-specific weighting factor."

2.  The "Buffer-Size-Independent Hypothesis" (here meaning work_mem):
"The choice of the best sequential plan is insensitive to the amount
of buffer space available as long as the buffer size is above the hash
join threshold" (with a caveat about localised special cases that can
be handled by choosing alternative subplans at runtime).

3.  The "Two-Phase Hypothesis": "The best parallel plan is a
parallelization of the best sequential plan."

I read all of that a while back while working on bits of parallel
query machinery (though I only realised after the fact that I'd
implemented parallel hash the same way as Hong did 27 years earlier,
that is, shared no-partition, which is now apparently back in vogue
due to the recent ubiquity of high core count shared memory systems,
so that every server looks a bit like a Sequent Symmetry; for example
Oracle is rumoured to have a "parallel shared hash join" like ours in
the pipeline).  I didn't understand the context or importance of XPRS,
though, until I read this bit of Hellerstein's "Looking Back at
Postgres"[2]:

"In principle, parallelism “blows up” the plan space for a query
optimizer by making it multiply the traditional choices made during
query optimization (data access, join algorithms, join orders) against
all possible ways of parallelizing each choice. The basic idea of what
Stonebraker called “The Wei Hong Optimizer” was to cut the problem in
two: run a traditional single-node query optimizer in the style of
System R, and then “parallelize” the resulting single-node query plan
by scheduling the degree of parallelism and placement of each operator
based on data layouts and system configuration. This approach is
heuristic, but it makes parallelism an additive cost to traditional
query optimization, rather than a multiplicative cost.

Although “The Wei Hong Optimizer” was designed in the context of
Postgres, it became the standard approach for many of the parallel
query optimizers in industry."

I don't know what to think about the buffer-size-independent
hypothesis, but the two-phase hypothesis and the claim that is is the
standard approach caught my attention.  Firstly, I don't think the
hypothesis holds on our system currently, because (for example) we
lack parallel merge joins and sorts, so you couldn't parallelise such
serial plans, and yet we'd already have thrown away a hash join based
plan that would be vastly better in parallel.  That might be just an
implementation completeness problem.  I wonder what fundamental
problems lurk here.  (Perhaps the non-availability of globally unique
partial paths?)  Anyway, AFAICS we do the exact thing Hong wanted to
avoid: we plan parallel queries as extra paths at planning time.  We
don't really suffer too much of a planning explosion though, because
we don't consider different parallel degrees.  If we did, because our
cost model doesn't include any penalty for resource usage, I suspect
we'd always go for the maximum number of workers because they're
'free', which creates a perverse incentive to burn resource (CPU +
copies of work_mem).  Those are all problems Hong solved with
execution time resource allocation, as part of a bigger picture.

I have no idea what to do about any of this but thought that was an
interesting bit of our project's history worth sharing.  It's really
humbling to read these old papers.  I wonder if we're missing a chance
to stand on the shoulders of giants.

[1] http://db.cs.berkeley.edu/jmh/tmp/pdis91-xprs.pdf
[2] https://arxiv.org/pdf/1901.01973.pdf

--
Thomas Munro
https://enterprisedb.com



pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Adding a test for speculative insert abort case
Next
From: Thomas Munro
Date:
Subject: Does TupleQueueReaderNext() really need to copy its result?