Re: XPRS - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: XPRS |
Date | |
Msg-id | 20190902172030.pui2ql22xs4qtdxz@development Whole thread Raw |
In response to | Re: XPRS (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: XPRS
|
List | pgsql-hackers |
On Mon, Sep 02, 2019 at 02:19:15PM +1200, Thomas Munro wrote: >On Sat, Aug 24, 2019 at 3:19 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> > Although “The Wei Hong Optimizer” was designed in the context of >> > Postgres, it became the standard approach for many of the parallel >> > query optimizers in industry." >> >> I assume this quote is from 30 years ago. I wonder if the claim is still >> true, on current hardware (including e.g. distributed databases). > >The quote is from 2018, and appears in the article I linked (it's a >chapter from the book Making Databases Work: The Wisdom of Michael >Stonebraker), but I'm not sure which systems it's referring to. > Hmm, that's unfortunate - it'd be quite interesting to know which databases it's referring to. I suspect no optimizer is ideal in this regard, i.e. each database has some "gaps" where some nodes don't have a straightforward parallel version. >Speculation: Many other systems have what we call parallel-oblivious >operators only, and then insert various exchange operators to make a >parallel plan. That is, they don't necessarily have a direct >equivalent of our "Parallel Sequential Scan", "Parallel Index Scan", >"Parallel Foreign Scan": they just use their regular scans, possibly >with the addition of some kind of "Parallel Scatter" node (that's my >made up name, it's the opposite of Gather, called various things like >"page supplier" or "block iterator") or "Parallel Repartition" >inserted in the right places. Perhaps they create a serial plan >first, and then try to parallelise it by inserting various parallel >operators and then recomputing the costs? Rather than considering the >separate paths in the first phase of the optimiser, as we do. The >cases where Hong's best-parallel-plan hypothesis isn't true for us now >might go away if we had Parallel Repartition, so that each 'stream' >would be the complete set of tuples for some known partition. > I don't know. It kinda reminds me planning with distributed databases, which also need exchange data between nodes in various cases - say, when joining two relations distributed in different ways. The redistribution is however pretty costly (network I/O, bandwidth etc.) to the extent that it's often much better to pick a very different join to reduce the amount of data to exchange, or eliminate the redistribution altogether. For parallelism the costs are much lower, of course, but I don't think we can just ignore those. FWIW it's not clear to me why the cost would need to be recomputed after constructing the parallel version of the plan? My understanding is that the idea is to do cost-based planning for the serial plan, and then just "mechanically" construct a parallel plan. Although, maybe there could be multiple parallel alternatives ... >To be clear, I'm not suggesting we do that necessarily, just pointing >out some interesting claims about ancient POSTGRES wisdom, in a highly >speculative water cooler thread. Actually, this isn't the first time >it's occurred to me that elements of our design were falling out of >the order that we chose to implement things in. Another example is >the "Shared Hash" that I had in an early version of my work on >Parallel Hash Join, where just one process would run a parallel-safe >but non-parallel-oblivious plan to build a shared hash table while >other workers twiddled their thumbs; I dropped it because our cost >model has no penalty for running N copies of the same plan rather than >just one so there was no way to pick that plan, and that's because we >don't have a cost model like Hong's that considers resource usage too. >Another more speculative observation: maybe no-partition/shared >Parallel Hash Join is only obvious if you already have the general >concept of parallel-aware executor nodes. AFAIK Robert and Amit >invented those to be able to coordinate parallel scans between >processes, where thread-based systems might be able to share a single >scan state somehow under a Scatter-like operator. If you had threads, >you might not need that concept that allows arbitrary executor nodes >to communicate with each other between workers, and then it might be >more obvious and natural to use repartitioning for parallelising hash >joins. > >> FWIW I think we'll have to do something about resource acquisition, sooner >> or later. It was always quite annoying that we don't really consider >> memory consumption of the query as a whole during planning, and parallel >> query made it a bit more painful. > >Agreed. > >Here's an approach I have been wondering about to cap total executor >memory usage, which is a much more down-to-Earth idea than any of the >above space cadet planner problems. Let's start at the other end of >the problem, by introducing admission control and memory quotas. That >is, keep using work_mem with its current per-node-copy meaning at >planning time, for now, and then: > >1. Compute the peak amount of memory each plan thinks it will need. >Initially that could be done by by summing estimates from all nodes >and considering workers. A later refinement could deal with nodes >that give back memory early, if we get around to doing that. The >estimate could be shown by EXPLAIN. (Some details to work out: worst >case vs expected etc.) > >2. Introduce a new GUC global_work_mem, which limits the total plan >that are allowed to run concurrently, according to their memory >estimates. Introduce a shared memory counter of currently allocated >quota. > >3. Introduce a new GUC session_work_mem, which is the amount of quota >that every session tries to acquire when it connects or perhaps first >runs a query, and that it won't give back until the end of the >session. Or perhaps they acquire less than that if they need less, >but that's the amount they never give back once they've got that much. >The idea is to allow queries with estimates under that limit, for >example high frequency OLTP queries, to avoid any extra communication >overhead from this scheme. > >4. To run queries that have estimates higher than the session's >current allocated quota, the session must acquire more quota for the >duration of the query. If it can't be acquired right now without >exceeding global_work_mem, it has to join a queue and wait. A >refinement could be that you are allowed to run with fewer workers >than planned to reduce the requirement. > >5. While executing, executor nodes could opportunistically ask for >more quota than was planned for, up to some limit, to avoid having to >spill to disk. If the request is unsuccessful, that's OK, they can >deal with that. > >6. So long as we have nodes that have no escape mechanism in certain >edge cases (hash aggregates and joins with bad stats and extreme >skew), you could perhaps have the option of raising an error or >forcing the total to exceed global_work_mem temporarily with a warning >(which would at least prevent other large queries from running and >making it worse). > >7. Regular heap memory and DSM memory should be counted together, >since it makes no difference to the operating system, it's all memory >and we should count it against the same quota. You'd probably want to >consider hidden allocator fragmentation too, as well as other hidden >overheads, to get decent accuracy. > >This is sort of fudging together of ideas from conversations with >Kevin Grittner (who talked about admission control a few years back), >Peter Geoghegan (who mentioned opportunistically asking for more), and >things I've heard of on SQL Server ("memory grants"). I think it >would provide some relief from the problems we see today: it's hard to >set work_mem so that you never get OOM but you can still use a decent >amount of your precious memory, especially with mixed parallel and >non-parallel query workloads thanks to our current >work_mem-multiplying design. > I think this is probably the simplest and most realistic first step. Whenever I was thinking about memory acquisition, I've assumed we'd monitor how much memory the plan is expected to use while we're constructing it. My main problem was what to do when we reach the per-query limit - whether to (a) simply reject the plan, (b) go back and see if we can replan with lower work_mem (but how much and for which nodes?), or (c) just continue. The proposed plan deals with this by not limiting the per-query (or rather per-session) budget directly, and instead requesting requesting additional budget. Which is nice. I suspect we should also keep an additional plan that is expected to meet the session_work_mem limit, aside from the regular cheapest plan, and use it if it's not much worse. Imagine you have a plan with cost 1000 that needs (global_work_mem/2 + 1kB) memory, essentially serializing executions of this query. And then there's an alternative plan with cost 1100 that can run with session_work_mem. It seems better to just accept the second plan, because it won't need to wait. Another challenge with work_mem is that anyone can modify it arbitrarily, i.e. a user can do SET work_mem = '1TB'; and use as much memory as they wist, or even crash the system. I wonder if we could define the new GUCs (session_work_mem and global_work_mem) in a way to prevent this. We probably don't want to make them PGC_POSTMASTER (it seems useful to allow overriding them in ALTER USER/DATABASE), but I don't think we have a good way to do that at the moment. Any ideas in this direction? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: