Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Trouble with hashagg spill I/O pattern and costing |
Date | |
Msg-id | 20200525121722.6wzcelweefyuwg54@development Whole thread Raw |
In response to | Re: Trouble with hashagg spill I/O pattern and costing (Tomas Vondra <tomas.vondra@2ndquadrant.com>) |
Responses |
Re: Trouble with hashagg spill I/O pattern and costing
|
List | pgsql-hackers |
On Mon, May 25, 2020 at 04:10:45AM +0200, Tomas Vondra wrote: > > ... > >parallel queries >================ > >And now the fun begins ... > > >1) small one (SSD, max_parallel_workers_per_gather = 2) > > algorithm master tlist prealloc prealloc+tlist > -------------------------------------------------- > hash 693 390 177 128 > sort 103 99 101 99 > >This looks pretty nice - the patches have the expected effect, it got >faster than with just a single CPU etc. > > >2) big one (SATA, max_parallel_workers_per_gather = 16) > > algorithm master tlist prealloc prealloc+tlist > -------------------------------------------------- > hash ? 25000 ? 3132 > sort 248 234 216 200 > >Well, not that nice :-( The hash queries take so much time that I've >decided not to wait for them and the two numbers are actually just >estimates (after processing just a couple of logical tapes). > >Plus it actually gets slower than with serial execution, so what's the >problem here? Especially considering it worked OK on the small machine? > >At first I thought it's something about SSD vs. SATA, but it seems to be >more about how we construct the plans, because the plans between the two >machines are very different. And it seems to be depend by the number of >workers per gather - for low number of workers the plan looks like this >(the plans are attached in plans.txt in case the formatting gets broken >by your client): > > > QUERY PLAN > --------------------------------------------------------------------------------------------------------------- > Limit > -> Aggregate > -> Hash Join > Hash Cond: (part.p_partkey = lineitem_1.l_partkey) > Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity)))) > -> Gather > Workers Planned: 2 > -> Nested Loop > -> Parallel Seq Scan on part > Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar)) > -> Index Scan using idx_lineitem_part_supp on lineitem > Index Cond: (l_partkey = part.p_partkey) > -> Hash > -> Finalize HashAggregate > Group Key: lineitem_1.l_partkey > -> Gather > Workers Planned: 2 > -> Partial HashAggregate > Group Key: lineitem_1.l_partkey > -> Parallel Seq Scan on lineitem lineitem_1 > (20 rows) > >but then if I crank the number of workers up, it switches to this: > > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------- > Limit > -> Finalize Aggregate > -> Gather > Workers Planned: 5 > -> Partial Aggregate > -> Nested Loop > Join Filter: (part.p_partkey = lineitem.l_partkey) > -> Hash Join > Hash Cond: (part.p_partkey = lineitem_1.l_partkey) > -> Parallel Seq Scan on part > Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar)) > -> Hash > -> HashAggregate > Group Key: lineitem_1.l_partkey > -> Seq Scan on lineitem lineitem_1 > -> Index Scan using idx_lineitem_part_supp on lineitem > Index Cond: (l_partkey = lineitem_1.l_partkey) > Filter: (l_quantity < ((0.2 * avg(lineitem_1.l_quantity)))) > (18 rows) > > >Notice that in the first plan, the hashagg is on top of parallel-aware >path - so each workers builds hashagg only on a subset of data, and also >spills only a fraction of the input rows (so that all workers combined >spill rouhly the "whole" table). > OK, I've done an experiment and re-ran the test with max_parallel_workers_per_gather = 5 which is the highest value still giving the "good" plan, and the results look like this: master tlist prealloc prealloc+tlist ---------------------------------------------------- hash 10535 1044 1723 407 sort 198 196 192 219 which is obviously *way* better than the numbers with more workers: > algorithm master tlist prealloc prealloc+tlist > -------------------------------------------------- > hash ? 25000 ? 3132 > sort 248 234 216 200 It's still ~2x slower than the sort, so presumably we'll need to tweak the costing somehow. I do belive this is still due to differences in I/O patterns, with parallel hashagg probably being a bit more random (I'm deducing that from SSD not being affected by this). I'd imagine this is because given the same work_mem value, sort tends to create "sorted chunks" that are then merged into larger runs, making it more sequential. OTOH hashagg likely makes it more random with smaller work_mem values - more batches making it more interleaved / random. This does not explain why we end up with the "bad" plans, though. Attached are two files showing how the plan changes for different number of workers per gather, both for groupagg and hashagg. For groupagg the plan shape does not change at all, for hashagg it starts as "good" and then between 5 and 6 switches to "bad" one. There's another interesting thing I just noticed - as we increase the number of workers, the cost estimate actually starts to grow at some point: workers | plan cost 0 | 23594267 1 | 20155545 2 | 19785306 5 | 22718176 <- 6 | 23063639 10 | 22990594 12 | 22972363 AFAIK this happens because we pick the number of workers simply based on size of the input relation, which ignores the cost due to sending data from workers to leaders (parallel_tuple_cost). Which in this case is quite significant, because each worker produces large number of groups. I don't think this is causing the issue, though, because the sort plans behave the same way. (I wonder if we could/should consider different number of workers, somehow.) We probably can't see these plans on 12 simply because hashagg would need more memory than work_mem (especially in parallel mode), so we simply reject them. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: