Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Trouble with hashagg spill I/O pattern and costing
Date
Msg-id 20200525021045.dilgcsmgiu4l5jpa@development
Whole thread Raw
In response to Re: Trouble with hashagg spill I/O pattern and costing  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses Re: Trouble with hashagg spill I/O pattern and costing
Re: Trouble with hashagg spill I/O pattern and costing
List pgsql-hackers
On Thu, May 21, 2020 at 11:41:22PM +0200, Tomas Vondra wrote:
>On Thu, May 21, 2020 at 02:16:37PM -0700, Jeff Davis wrote:
>>On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
>>>2) We could make it self-tuning, by increasing the number of blocks
>>>we pre-allocate. So every time we exhaust the range, we double the
>>>number of blocks (with a reasonable maximum, like 1024 or so). Or we
>>>might just increment it by 32, or something.
>>
>>Attached a new version that uses the doubling behavior, and cleans it
>>up a bit. It also returns the unused prealloc blocks back to lts-
>>freeBlocks when the tape is rewound for reading.
>>
>
>Ah, the returning is a nice idea, that should limit the overhead quite a
>bit, I think.
>
>>>IIUC the danger of pre-allocating blocks is that we might not fill
>>>them,
>>>resulting in temp file much larger than necessary. It might be
>>>harmless
>>>on some (most?) current filesystems that don't actually allocate
>>>space
>>>for blocks that are never written, but it also confuses our
>>>accounting
>>>of temporary file sizes. So we should try to limit that, and growing
>>>the
>>>number of pre-allocated blocks over time seems reasonable.
>>
>>There's another danger here: it doesn't matter how well the filesystem
>>deals with sparse writes, because ltsWriteBlock fills in the holes with
>>zeros anyway. That's potentially a significant amount of wasted IO
>>effort if we aren't careful.
>>
>
>True. I'll give it a try on both machines and report some numbers. Might
>take a couple of days.
>

OK, so I do have some numbers to share. I think there's a clear
conclusion that the two patches are a huge improvement, but there's also
something fishy about planning of parallel queries.

Firstly, I have two machines that I used for testing:

1) small one: i5-2500k (4 cores), 8GB RAM, SSD RAID for data, SSD for
temporary tablespace, using TPC-H 32GB data set

2) big one: 2x xeon e5-2620v3 (8 cores), 64GB RAM, NVME SSD for data,
temporary tablespace on SATA RAID0 (3 x 7.2k), using TPC-H 75GB


serial queries (no parallelism)
===============================

Results with parallel query disabled on the two machines look like this:

1) small one (SSD)

     algorithm  master  prealloc  tlist  prealloc-tlist
     --------------------------------------------------
          hash    1365       437    368             213
          sort     226       214    224             215

The sort row simply means "enable_hashagg = off" and AFAIK the patches
should not have a lot of influence here - the prealloc does, but it's
fairly negligible.

It's not always exactly on part, I've seen cases where hash or sort were
a bit faster (probably depending on work_mem), but I think we can ignore
that for now.


2) big one (SATA)

     algorithm  master  tlist  prealloc  prealloc+tlist
     --------------------------------------------------
          hash   25534   5120      2402             540
          sort     460    460       465             485

The effect is even more pronounced, thanks to poor handling of random
I/O by the SATA RAID device. It's not exactly on par with sort, but it's
close enough ...


parallel queries
================

And now the fun begins ...


1) small one (SSD, max_parallel_workers_per_gather = 2)

     algorithm  master  tlist  prealloc  prealloc+tlist
     --------------------------------------------------
          hash   693      390       177             128
          sort   103       99       101              99

This looks pretty nice - the patches have the expected effect, it got
faster than with just a single CPU etc.


2) big one (SATA, max_parallel_workers_per_gather = 16)

     algorithm  master  tlist  prealloc  prealloc+tlist
     --------------------------------------------------
          hash       ?  25000         ?            3132
          sort     248    234       216             200

Well, not that nice :-( The hash queries take so much time that I've
decided not to wait for them and the two numbers are actually just
estimates (after processing just a couple of logical tapes).

Plus it actually gets slower than with serial execution, so what's the
problem here? Especially considering it worked OK on the small machine?

At first I thought it's something about SSD vs. SATA, but it seems to be
more about how we construct the plans, because the plans between the two
machines are very different. And it seems to be depend by the number of
workers per gather - for low number of workers the plan looks like this
(the plans are attached in plans.txt in case the formatting gets broken
by your client):


                                                       QUERY PLAN
     ---------------------------------------------------------------------------------------------------------------
      Limit
        ->  Aggregate
              ->  Hash Join
                    Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
                    Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
                    ->  Gather
                          Workers Planned: 2
                          ->  Nested Loop
                                ->  Parallel Seq Scan on part
                                      Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                                ->  Index Scan using idx_lineitem_part_supp on lineitem
                                      Index Cond: (l_partkey = part.p_partkey)
                    ->  Hash
                          ->  Finalize HashAggregate
                                Group Key: lineitem_1.l_partkey
                                ->  Gather
                                      Workers Planned: 2
                                      ->  Partial HashAggregate
                                            Group Key: lineitem_1.l_partkey
                                            ->  Parallel Seq Scan on lineitem lineitem_1
     (20 rows)

but then if I crank the number of workers up, it switches to this:

                                                          QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
      Limit
        ->  Finalize Aggregate
              ->  Gather
                    Workers Planned: 5
                    ->  Partial Aggregate
                          ->  Nested Loop
                                Join Filter: (part.p_partkey = lineitem.l_partkey)
                                ->  Hash Join
                                      Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
                                      ->  Parallel Seq Scan on part
                                            Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG
BOX'::bpchar))
                                      ->  Hash
                                            ->  HashAggregate
                                                  Group Key: lineitem_1.l_partkey
                                                  ->  Seq Scan on lineitem lineitem_1
                                ->  Index Scan using idx_lineitem_part_supp on lineitem
                                      Index Cond: (l_partkey = lineitem_1.l_partkey)
                                      Filter: (l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
     (18 rows)


Notice that in the first plan, the hashagg is on top of parallel-aware
path - so each workers builds hashagg only on a subset of data, and also
spills only a fraction of the input rows (so that all workers combined
spill rouhly the "whole" table).

In the second plan, the hashagg is on the non-partitioned side of the
join, so each workers builds a hash aggregate on the *whole* set of
input rows. Which means that (a) we need much more disk space for temp
files, making it unlikely to fit into page cache and (b) there's a lot
of contention for I/O, making it much more random.

Now, I haven't seen the second plan with sort-based aggregation, no
matter how I set the number of workers it always looks like this:

                                                          QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
      Limit
        ->  Aggregate
              ->  Merge Join
                    Merge Cond: (lineitem_1.l_partkey = part.p_partkey)
                    Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
                    ->  Finalize GroupAggregate
                          Group Key: lineitem_1.l_partkey
                          ->  Gather Merge
                                Workers Planned: 8
                                ->  Partial GroupAggregate
                                      Group Key: lineitem_1.l_partkey
                                      ->  Sort
                                            Sort Key: lineitem_1.l_partkey
                                            ->  Parallel Seq Scan on lineitem lineitem_1
                    ->  Materialize
                          ->  Gather Merge
                                Workers Planned: 6
                                ->  Nested Loop
                                      ->  Parallel Index Scan using part_pkey on part
                                            Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG
BOX'::bpchar))
                                      ->  Index Scan using idx_lineitem_part_supp on lineitem
                                            Index Cond: (l_partkey = part.p_partkey)
     (22 rows)

How come we don't have the same issue here? Is there something in the
optimizer that prevents us from creating the "silly" plans with
groupagg, and we should do the same thing for hashagg?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Ian Barwick
Date:
Subject: pg13 docs: minor fix for "System views" list
Next
From: Kyotaro Horiguchi
Date:
Subject: Re: WAL reader APIs and WAL segment open/close callbacks