We should decide whether to actually sort in parallel based on the comparator cost and the data size. The system currently has no information on comparator cost: bt*cmp (and indeed almost all built-in functions) all have procost=1, but bttextcmp is at least 1000x slower than btint4cmp. Let's improve the procost estimates of all core B-tree and hash operators. This will have other benefits, but we will need to be cognizant of the risk of upsetting setups that have tuned cpu_operator_cost based on the present situation.
The choice of whether to parallelize can probably be made a manner similar to the choice to do an external sort: the planner guesses the outcome for costing purposes, but the actual decision is made at execution time. The planner would determine a tuple count cutoff at which parallelism becomes favorable, and tuplesort would check that to establish its actual decision.
It probably crossovers my problem consciousness to off-load CPU bounds
workloads; that I partially tried to implement on writable foreign table feature.
Not only sorting stuff, I think it may be worthful to have capability to push
heavy workload (like sort, aggregate or complex target-list) out external
computing resources.
However, I doubt whether the decision to parallelize should be done in
execution time, rather than plan stage. For example, in case when we
have enough number of records and 10-core multiprocessor, the wise
plan may take parallel data load by 10-processors, partial-sort by 10-
processors individually, then merge-sort. It needs fundamental different
tree structure from the traditional single-processors based plan-tree.
So, it seems to me we should take an enhancement to allow to inject
plan-tree special purpose parallel processing plan node.