Re: XPRS - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: XPRS |
Date | |
Msg-id | CA+hUKGJEMT7SSZRqt-knu_3iLkdscBCe9M2nrhC259FdE5bX7g@mail.gmail.com Whole thread Raw |
In response to | Re: XPRS (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: XPRS
|
List | pgsql-hackers |
On Sat, Aug 24, 2019 at 3:19 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > Although “The Wei Hong Optimizer” was designed in the context of > > Postgres, it became the standard approach for many of the parallel > > query optimizers in industry." > > I assume this quote is from 30 years ago. I wonder if the claim is still > true, on current hardware (including e.g. distributed databases). The quote is from 2018, and appears in the article I linked (it's a chapter from the book Making Databases Work: The Wisdom of Michael Stonebraker), but I'm not sure which systems it's referring to. Speculation: Many other systems have what we call parallel-oblivious operators only, and then insert various exchange operators to make a parallel plan. That is, they don't necessarily have a direct equivalent of our "Parallel Sequential Scan", "Parallel Index Scan", "Parallel Foreign Scan": they just use their regular scans, possibly with the addition of some kind of "Parallel Scatter" node (that's my made up name, it's the opposite of Gather, called various things like "page supplier" or "block iterator") or "Parallel Repartition" inserted in the right places. Perhaps they create a serial plan first, and then try to parallelise it by inserting various parallel operators and then recomputing the costs? Rather than considering the separate paths in the first phase of the optimiser, as we do. The cases where Hong's best-parallel-plan hypothesis isn't true for us now might go away if we had Parallel Repartition, so that each 'stream' would be the complete set of tuples for some known partition. To be clear, I'm not suggesting we do that necessarily, just pointing out some interesting claims about ancient POSTGRES wisdom, in a highly speculative water cooler thread. Actually, this isn't the first time it's occurred to me that elements of our design were falling out of the order that we chose to implement things in. Another example is the "Shared Hash" that I had in an early version of my work on Parallel Hash Join, where just one process would run a parallel-safe but non-parallel-oblivious plan to build a shared hash table while other workers twiddled their thumbs; I dropped it because our cost model has no penalty for running N copies of the same plan rather than just one so there was no way to pick that plan, and that's because we don't have a cost model like Hong's that considers resource usage too. Another more speculative observation: maybe no-partition/shared Parallel Hash Join is only obvious if you already have the general concept of parallel-aware executor nodes. AFAIK Robert and Amit invented those to be able to coordinate parallel scans between processes, where thread-based systems might be able to share a single scan state somehow under a Scatter-like operator. If you had threads, you might not need that concept that allows arbitrary executor nodes to communicate with each other between workers, and then it might be more obvious and natural to use repartitioning for parallelising hash joins. > FWIW I think we'll have to do something about resource acquisition, sooner > or later. It was always quite annoying that we don't really consider > memory consumption of the query as a whole during planning, and parallel > query made it a bit more painful. Agreed. Here's an approach I have been wondering about to cap total executor memory usage, which is a much more down-to-Earth idea than any of the above space cadet planner problems. Let's start at the other end of the problem, by introducing admission control and memory quotas. That is, keep using work_mem with its current per-node-copy meaning at planning time, for now, and then: 1. Compute the peak amount of memory each plan thinks it will need. Initially that could be done by by summing estimates from all nodes and considering workers. A later refinement could deal with nodes that give back memory early, if we get around to doing that. The estimate could be shown by EXPLAIN. (Some details to work out: worst case vs expected etc.) 2. Introduce a new GUC global_work_mem, which limits the total plan that are allowed to run concurrently, according to their memory estimates. Introduce a shared memory counter of currently allocated quota. 3. Introduce a new GUC session_work_mem, which is the amount of quota that every session tries to acquire when it connects or perhaps first runs a query, and that it won't give back until the end of the session. Or perhaps they acquire less than that if they need less, but that's the amount they never give back once they've got that much. The idea is to allow queries with estimates under that limit, for example high frequency OLTP queries, to avoid any extra communication overhead from this scheme. 4. To run queries that have estimates higher than the session's current allocated quota, the session must acquire more quota for the duration of the query. If it can't be acquired right now without exceeding global_work_mem, it has to join a queue and wait. A refinement could be that you are allowed to run with fewer workers than planned to reduce the requirement. 5. While executing, executor nodes could opportunistically ask for more quota than was planned for, up to some limit, to avoid having to spill to disk. If the request is unsuccessful, that's OK, they can deal with that. 6. So long as we have nodes that have no escape mechanism in certain edge cases (hash aggregates and joins with bad stats and extreme skew), you could perhaps have the option of raising an error or forcing the total to exceed global_work_mem temporarily with a warning (which would at least prevent other large queries from running and making it worse). 7. Regular heap memory and DSM memory should be counted together, since it makes no difference to the operating system, it's all memory and we should count it against the same quota. You'd probably want to consider hidden allocator fragmentation too, as well as other hidden overheads, to get decent accuracy. This is sort of fudging together of ideas from conversations with Kevin Grittner (who talked about admission control a few years back), Peter Geoghegan (who mentioned opportunistically asking for more), and things I've heard of on SQL Server ("memory grants"). I think it would provide some relief from the problems we see today: it's hard to set work_mem so that you never get OOM but you can still use a decent amount of your precious memory, especially with mixed parallel and non-parallel query workloads thanks to our current work_mem-multiplying design. -- Thomas Munro https://enterprisedb.com
pgsql-hackers by date: