Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
Date | |
Msg-id | CA+TgmobOwH=h8ZQ7Kn72hnteJbReZ=e3CbKsRiStZXiZYakTjA@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) (Thomas Munro <thomas.munro@enterprisedb.com>) |
Responses |
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation)
Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) Re: [HACKERS] Parallel tuplesort (for parallel B-Tree index creation) |
List | pgsql-hackers |
On Tue, Jan 9, 2018 at 10:36 PM, Thomas Munro <thomas.munro@enterprisedb.com> wrote: > This looks good to me. The addition to README.parallel is basically wrong, because workers have been allowed to write WAL since the ParallelContext machinery. See the XactLastRecEnd handling in parallel.c. Workers can, for example, due HOT cleanups during SELECT scans, just as the leader can. The language here is obsolete anyway in light of commit e9baa5e9fa147e00a2466ab2c40eb99c8a700824, but this isn't the right way to update it. I'll propose a separate patch for that. The change to the ParallelContext signature in parallel.h makes an already-overlength line even longer. A line break seems warranted just after the first argument, plus pgindent afterward. I am not a fan of the leader-as-worker terminology. The leader is not a worker, full stop. I think we should instead talk about whether the leader participates (so, ii_LeaderAsWorker -> ii_LeaderParticipates, for example, plus many comment updates). Similarly, it seems SortCoordinateData's nLaunched should be nParticipants, and BTLeader's nworkertuplesorts should be nparticipanttuplesorts. There is also the question of whether we want to respect parallel_leader_participation in this context. The issues which might motivate the desire for such behavior in the context of a query do not exist when creating a btree index, so maybe we're just making this complicated. On the other hand, if some other type of parallel index build does end up doing a Gather-like operation then we might regret deciding that parallel_leader_participation doesn't apply to index builds, so maybe it's OK the way we have it. On the third hand, the complexity of having the leader maybe-participate seems like it extends to a fair number of places in the code, and getting rid of all that complexity seems appealing. One place where this actually causes a problem is the message changes to index_build(). The revised ereport() violates translatability guidelines, which require that messages not be assembled from pieces. See https://www.postgresql.org/docs/devel/static/nls-programmer.html#NLS-GUIDELINES A comment added to tuplesort.h says that work_mem should be at least 64KB, but does not give any reason. I think one should be given, at least briefly, so that someone looking at these comments in the future can, for example, figure out whether the comment is still correct after future code changes. Or else, remove the comment. + * Parallel sort callers are required to coordinate multiple tuplesort states + * in a leader process, and one or more worker processes. The leader process I think the comma should be removed. As written it, it looks like we are coordinating multiple tuplesort states in a leader process, and, separately, we are coordinating one or more worker processes. But in fact we are coordinating multiple tuplesort states which are in a group of processes that includes the leader and one or more worker processes. Generally, I think the comments in tuplesort.h are excellent. I really like the overview of how the new interfaces should be used, although I find it slightly wonky that the leader needs two separate Tuplesortstates if it wants to participate. I don't understand why this patch needs to tinker with the tests in vacuum.sql. The comments say that "If we did not do this, errors raised would concern running ANALYZE in parallel mode." However, why should parallel CREATE INDEX having any impact on ANALYZE at all? Also, as a practical matter, if I revert those changes, 'make check' still passes with or without force_parallel_mode=on. I really dislike the fact that this patch invents another thing for force_parallel_mode to do. I invented force_parallel_mode mostly as a way of testing that functions were correctly labeled for parallel-safety, and I think it would be just fine if it never does anything else. As it is, it already does two quite separate things to accomplish that goal: (1) forcibly run the whole plan with parallel mode restrictions enabled, provided that the plan is not parallel-unsafe, and (2) runs the plan in a worker, provided that the plan is parallel-safe. There's a subtle difference between those two condition, which is that not parallel-unsafe does not equal parallel-safe; there is also parallel-restricted. The fact that force_parallel_mode controls two different behaviors has, I think, already caused some confusion for prominent PostgreSQL developers and, likely, users as well. Making it do a third thing seems to me to be adding to the confusion, and not only because there are no documentation changes to match. If we go down this road, there will probably be more additions -- what happens when parallel VACUUM arrives, or parallel CLUSTER, or whatever? I don't think it will be a good thing for PostgreSQL if we end up with force_parallel_mode=on as a general "use parallelism even though it's stupid" flag, requiring supporting code in many different places throughout the code base and a laundry list of not-actually-useful behavior changes in the documentation. What I think would be a lot more useful, and what I sort of expected the patch to have, is a way for a user to explicitly control the number of workers requested for a CREATE INDEX operation. We all know that the cost model is crude and that may be OK -- though it would be interesting to see some research on what the run times actually look like for various numbers of workers at various table sizes and work_mem settings -- but it will be inconvenient for DBAs who actually know what number of workers they want to use to instead get whatever value plan_create_index_workers() decide to emit. They can force it by setting the parallel_workers reloption, but that affects queries. They can probably also do it by setting min_parallel_table_scan_size = 0 and max_parallel_workers_maintenance to whatever value they want, but I think it would be convenient for there to be a more straightforward way to do it, or at least some documentation in the CREATE INDEX page about how to get the number of workers you really want. To be clear, I don't think that this is a must-fix issue for this patch to get committed, but I do think that all reference to force_parallel_mode=on should go away. I do not like the way that this patch wants to turn the section of the documentation on when parallel query can be used into a discussion of when parallelism can be used. I think it would be better to leave that section alone and instead document under CREATE INDEX the concerns specific to parallel index build. I think this will be easier for users to understand and far easier to maintain as the number of parallel DDL operations increases, which I expect it to do somewhat explosively. The patch as written says things like "If a utility statement that is expected to do so does not produce a parallel plan, ..." but, one, utility statements *do not produce plans of any type* and, two, the concerns here are really specific to parallel CREATE INDEX and there is every reason to think that they might be different in other cases. I feel strongly that it's enough for this section to try to explain the concerns that pertain to optimizable queries and leave utility commands to be treated elsewhere. If we find that we're accumulating a lot of documentation for various parallel utility commands that seems to be duplicative, we can write a general treatment of that topic that is separate from this one. The documentation for max_parallel_workers_maintenance cribs from the documentation for max_parallel_workers_per_gather in saying that we'll use fewer workers than expected "which may be inefficient". However, for parallel CREATE INDEX, that trailing clause is, at least as far as I can see, not applicable. For a query, we might choose a Gather over a Parallel Seq Scan because we think we've got a lot of workers; with only one participant, we might prefer a GIN index scan. If it turns out we don't get the workers, we've got a clearly suboptimal plan. For CREATE INDEX, though, it seems to me that we don't make any decisions based on the number of workers we think we'll have. If we get fewer workers, it may be slower, but it shouldn't still be as fast as it can be with that number of workers, which for queries is not the case. + * These fields are not modified throughout the sort. They primarily + * exist for the benefit of worker processes, that need to create BTSpool + * state corresponding to that used by the leader. throughout -> during remove comma + * builds, that must work just the same when an index is built in remove comma + * State that is aggregated by workers, to report back to leader. State that is maintained by workers and reported back to leader. + * indtuples is the total number of tuples that made it into index. into the index + * btleader is only present when a parallel index build is performed, and + * only in leader process (actually, only the leader has a BTBuildState. + * Workers have their own spool and spool2, though.) the leader process period after "process" capitalize actually + * Done. Leave a way for leader to determine we're finished. Record how + * many tuples were in this worker's share of the relation. I don't understand what the "Leave a way" comment means. + * To support parallel sort operations involving coordinated callers to + * tuplesort.c routines across multiple workers, it is necessary to + * concatenate each worker BufFile/tapeset into one single leader-wise + * logical tapeset. Workers should have produced one final materialized + * tape (their entire output) when this happens in leader; there will always + * be the same number of runs as input tapes, and the same number of input + * tapes as workers. I can't interpret the word "leader-wise". A partition-wise join is a join done one partition at a time, but a leader-wise logical tape set is not done one leader at a time. If there's another meaning to the affix -wise, I'm not familiar with it. Don't we just mean "a single logical tapeset managed by the leader"? There's a lot here I haven't grokked yet, but I'm running out of mental energy so I think I'll send this for now and work on this some more when time permits, hopefully tomorrow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: