Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Trouble with hashagg spill I/O pattern and costing
Date
Msg-id 20200525121722.6wzcelweefyuwg54@development
Whole thread Raw
In response to Re: Trouble with hashagg spill I/O pattern and costing  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Trouble with hashagg spill I/O pattern and costing
List pgsql-hackers
On Mon, May 25, 2020 at 04:10:45AM +0200, Tomas Vondra wrote:
>
> ...
>
>parallel queries
>================
>
>And now the fun begins ...
>
>
>1) small one (SSD, max_parallel_workers_per_gather = 2)
>
>    algorithm  master  tlist  prealloc  prealloc+tlist
>    --------------------------------------------------
>         hash   693      390       177             128
>         sort   103       99       101              99
>
>This looks pretty nice - the patches have the expected effect, it got
>faster than with just a single CPU etc.
>
>
>2) big one (SATA, max_parallel_workers_per_gather = 16)
>
>    algorithm  master  tlist  prealloc  prealloc+tlist
>    --------------------------------------------------
>         hash       ?  25000         ?            3132
>         sort     248    234       216             200
>
>Well, not that nice :-( The hash queries take so much time that I've
>decided not to wait for them and the two numbers are actually just
>estimates (after processing just a couple of logical tapes).
>
>Plus it actually gets slower than with serial execution, so what's the
>problem here? Especially considering it worked OK on the small machine?
>
>At first I thought it's something about SSD vs. SATA, but it seems to be
>more about how we construct the plans, because the plans between the two
>machines are very different. And it seems to be depend by the number of
>workers per gather - for low number of workers the plan looks like this
>(the plans are attached in plans.txt in case the formatting gets broken
>by your client):
>
>
>                                                      QUERY PLAN
>    ---------------------------------------------------------------------------------------------------------------
>     Limit
>       ->  Aggregate
>             ->  Hash Join
>                   Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
>                   Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
>                   ->  Gather
>                         Workers Planned: 2
>                         ->  Nested Loop
>                               ->  Parallel Seq Scan on part
>                                     Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
>                               ->  Index Scan using idx_lineitem_part_supp on lineitem
>                                     Index Cond: (l_partkey = part.p_partkey)
>                   ->  Hash
>                         ->  Finalize HashAggregate
>                               Group Key: lineitem_1.l_partkey
>                               ->  Gather
>                                     Workers Planned: 2
>                                     ->  Partial HashAggregate
>                                           Group Key: lineitem_1.l_partkey
>                                           ->  Parallel Seq Scan on lineitem lineitem_1
>    (20 rows)
>
>but then if I crank the number of workers up, it switches to this:
>
>                                                         QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------
>     Limit
>       ->  Finalize Aggregate
>             ->  Gather
>                   Workers Planned: 5
>                   ->  Partial Aggregate
>                         ->  Nested Loop
>                               Join Filter: (part.p_partkey = lineitem.l_partkey)
>                               ->  Hash Join
>                                     Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
>                                     ->  Parallel Seq Scan on part
>                                           Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG
BOX'::bpchar))
>                                     ->  Hash
>                                           ->  HashAggregate
>                                                 Group Key: lineitem_1.l_partkey
>                                                 ->  Seq Scan on lineitem lineitem_1
>                               ->  Index Scan using idx_lineitem_part_supp on lineitem
>                                     Index Cond: (l_partkey = lineitem_1.l_partkey)
>                                     Filter: (l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
>    (18 rows)
>
>
>Notice that in the first plan, the hashagg is on top of parallel-aware
>path - so each workers builds hashagg only on a subset of data, and also
>spills only a fraction of the input rows (so that all workers combined
>spill rouhly the "whole" table).
>

OK, I've done an experiment and re-ran the test with

     max_parallel_workers_per_gather = 5

which is the highest value still giving the "good" plan, and the results
look like this:

            master    tlist    prealloc    prealloc+tlist
     ----------------------------------------------------
     hash    10535     1044        1723               407
     sort      198      196         192               219

which is obviously *way* better than the numbers with more workers:

>    algorithm  master  tlist  prealloc  prealloc+tlist
>    --------------------------------------------------
>         hash       ?  25000         ?            3132
>         sort     248    234       216             200

It's still ~2x slower than the sort, so presumably we'll need to tweak
the costing somehow. I do belive this is still due to differences in I/O
patterns, with parallel hashagg probably being a bit more random (I'm
deducing that from SSD not being affected by this).

I'd imagine this is because given the same work_mem value, sort tends to
create "sorted chunks" that are then merged into larger runs, making it
more sequential. OTOH hashagg likely makes it more random with smaller
work_mem values - more batches making it more interleaved / random.


This does not explain why we end up with the "bad" plans, though.

Attached are two files showing how the plan changes for different number
of workers per gather, both for groupagg and hashagg. For groupagg the
plan shape does not change at all, for hashagg it starts as "good" and
then between 5 and 6 switches to "bad" one.

There's another interesting thing I just noticed - as we increase the
number of workers, the cost estimate actually starts to grow at some
point:

    workers | plan cost
          0 | 23594267
     1 | 20155545
     2 | 19785306
     5 | 22718176 <-
     6 | 23063639
    10 | 22990594
    12 | 22972363

AFAIK this happens because we pick the number of workers simply based on
size of the input relation, which ignores the cost due to sending data
from workers to leaders (parallel_tuple_cost).  Which in this case is
quite significant, because each worker produces large number of groups.
I don't think this is causing the issue, though, because the sort plans
behave the same way. (I wonder if we could/should consider different
number of workers, somehow.)

We probably can't see these plans on 12 simply because hashagg would
need more memory than work_mem (especially in parallel mode), so we
simply reject them.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: Re: PostgresSQL 13.0 Beta 1 on Phoronix
Next
From: Joe Conway
Date:
Subject: Re: repeat() function, CHECK_FOR_INTERRUPTS(), and unlikely()