XPRS - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | XPRS |
Date | |
Msg-id | CA+hUKGL-Fo9mZyFK1tdmzFng2puRBrgROsCiB1=n7wP79mTZ+g@mail.gmail.com Whole thread Raw |
Responses |
Re: XPRS
|
List | pgsql-hackers |
Hello, After rereading some old papers recently, I wanted to share some thoughts about XPRS and modern PostgreSQL. XPRS stood for "eXtended Postgres on RAID and Sprite", and was a research project done nearly three decades ago at Berkeley by the POSTGRES group working with operating system researchers, on a shared memory machine with a lot of CPUs for the time (12). As far as I can tell (and if anyone knows how to find out, I'd love to know), the parallel query parts of the XPRS system described in various writings by Wei Hong and Michael Stonebraker are actually present in the POSTGRES 4.2 tarball and were removed in Postgres95. Evidence: 4.2's parallel code is all wrapped in #ifdef sequent, and we know that XPRS ran on a Sequent Symmetry; the parallel hash join algorithm matches the description given in Hong's doctoral thesis; the page and range based parallel scans seen in various places also seem to match. Hong's thesis covers a lot of material and I certainly haven't understood all of it, but basically it's about how to share CPU, IO bandwidth and memory out fairly and dynamically at execution time so you're using the whole system efficiently. Facets of this problem obviously keep coming up on this mailing list (see practically any discussion of parallel degree, admission control, prefetch or work_mem, all of which we punt on by deferring to user supplied GUCs and scan size-based heuristics). Here are three things I wanted to highlight from Hong's 1991 paper[1] (later writings elaborate greatly but this one is short and much easier to read than the thesis and sets the scene): 1. "The overall performance goal of a multiprocessor database system is to obtain increased throughput as well as reduced response time in a multiuser environment. The objective function that XPRS uses for query optimization is a combination of resource consumption and response time as follows: cost = resource_consumption + w * response_time, where w is a system-specific weighting factor." 2. The "Buffer-Size-Independent Hypothesis" (here meaning work_mem): "The choice of the best sequential plan is insensitive to the amount of buffer space available as long as the buffer size is above the hash join threshold" (with a caveat about localised special cases that can be handled by choosing alternative subplans at runtime). 3. The "Two-Phase Hypothesis": "The best parallel plan is a parallelization of the best sequential plan." I read all of that a while back while working on bits of parallel query machinery (though I only realised after the fact that I'd implemented parallel hash the same way as Hong did 27 years earlier, that is, shared no-partition, which is now apparently back in vogue due to the recent ubiquity of high core count shared memory systems, so that every server looks a bit like a Sequent Symmetry; for example Oracle is rumoured to have a "parallel shared hash join" like ours in the pipeline). I didn't understand the context or importance of XPRS, though, until I read this bit of Hellerstein's "Looking Back at Postgres"[2]: "In principle, parallelism “blows up” the plan space for a query optimizer by making it multiply the traditional choices made during query optimization (data access, join algorithms, join orders) against all possible ways of parallelizing each choice. The basic idea of what Stonebraker called “The Wei Hong Optimizer” was to cut the problem in two: run a traditional single-node query optimizer in the style of System R, and then “parallelize” the resulting single-node query plan by scheduling the degree of parallelism and placement of each operator based on data layouts and system configuration. This approach is heuristic, but it makes parallelism an additive cost to traditional query optimization, rather than a multiplicative cost. Although “The Wei Hong Optimizer” was designed in the context of Postgres, it became the standard approach for many of the parallel query optimizers in industry." I don't know what to think about the buffer-size-independent hypothesis, but the two-phase hypothesis and the claim that is is the standard approach caught my attention. Firstly, I don't think the hypothesis holds on our system currently, because (for example) we lack parallel merge joins and sorts, so you couldn't parallelise such serial plans, and yet we'd already have thrown away a hash join based plan that would be vastly better in parallel. That might be just an implementation completeness problem. I wonder what fundamental problems lurk here. (Perhaps the non-availability of globally unique partial paths?) Anyway, AFAICS we do the exact thing Hong wanted to avoid: we plan parallel queries as extra paths at planning time. We don't really suffer too much of a planning explosion though, because we don't consider different parallel degrees. If we did, because our cost model doesn't include any penalty for resource usage, I suspect we'd always go for the maximum number of workers because they're 'free', which creates a perverse incentive to burn resource (CPU + copies of work_mem). Those are all problems Hong solved with execution time resource allocation, as part of a bigger picture. I have no idea what to do about any of this but thought that was an interesting bit of our project's history worth sharing. It's really humbling to read these old papers. I wonder if we're missing a chance to stand on the shoulders of giants. [1] http://db.cs.berkeley.edu/jmh/tmp/pdis91-xprs.pdf [2] https://arxiv.org/pdf/1901.01973.pdf -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: