Thread: Disk-based hash aggregate's cost model

Disk-based hash aggregate's cost model

From

Peter Geoghegan

Date:

28 August 2020, 00:28:20

We have a Postgres 13 open item for Disk-based hash aggregate, which
is the only non-trivial open item. There is a general concern about
how often we get disk-based hash aggregation when work_mem is
particularly low, and recursion seems unavoidable. This is generally
thought to be a problem in the costing.

Tomas Vondra did some testing of the patch which led to discussion of
this on the hash agg GUC megathread, here:

https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development

I am starting this new thread to deal with the open item.

Any progress on this, Jeff? Please correct or expand on my summary of
the open item if I got something wrong.

Thanks
-- 
Peter Geoghegan

Re: Disk-based hash aggregate's cost model

From

Jeff Davis

Date:

29 August 2020, 01:32:38

On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote:
> We have a Postgres 13 open item for Disk-based hash aggregate, which
> is the only non-trivial open item. There is a general concern about
> how often we get disk-based hash aggregation when work_mem is
> particularly low, and recursion seems unavoidable. This is generally
> thought to be a problem in the costing.

We discussed two approaches to tweaking the cost model:

1. Penalize HashAgg disk costs by a constant amount. It seems to be
chosen a little too often, and we can reduce the number of plan
changes.

2. Treat recursive HashAgg spilling skeptically, and heavily penalize
recursive spilling.

The problem with approach #2 is that we have a default hash mem of 4MB,
and real systems have a lot more than that. In this scenario, recursive
spilling can beat Sort by a lot.

For instance:

Data:
  create table text10m(t text collate "C.UTF-8", i int, n numeric);
  insert into t10m
    select s.g::text, s.g, s.g::numeric
    from (
      select (random()*1000000000)::int as g
      from generate_series(1,10000000)) s;
  vacuum (freeze,analyze) text10m;

Query: explain analyze select distinct t from text10m;

HashAgg: 10.5s
Sort+Distinct: 46s

I'm inclined toward option #1 for simplicity unless you feel strongly
about option #2. Specifically, I was thinking of a 1.5X penalty for
HashAgg disk costs.

Regards,
    Jeff Davis

Re: Disk-based hash aggregate's cost model

From

Tomas Vondra

Date:

30 August 2020, 00:26:20

On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote:
>On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote:
>> We have a Postgres 13 open item for Disk-based hash aggregate, which
>> is the only non-trivial open item. There is a general concern about
>> how often we get disk-based hash aggregation when work_mem is
>> particularly low, and recursion seems unavoidable. This is generally
>> thought to be a problem in the costing.
>
>We discussed two approaches to tweaking the cost model:
>
>1. Penalize HashAgg disk costs by a constant amount. It seems to be
>chosen a little too often, and we can reduce the number of plan
>changes.
>
>2. Treat recursive HashAgg spilling skeptically, and heavily penalize
>recursive spilling.
>
>The problem with approach #2 is that we have a default hash mem of 4MB,
>and real systems have a lot more than that. In this scenario, recursive
>spilling can beat Sort by a lot.
>

I think the main issue is that we're mostly speculating what's wrong.
I've shared some measurements and symptoms, and we've discussed what
might be causing it, but I'm not really sure we know for sure.

I really dislike (1) because it seems more like "We don't know what's
wrong so we'll penalize hashagg," kind of solution. A much more
principled solution would be to tweak the costing accordingly, not just
by adding some constant. For (2) it really depends if recursive spilling
is really the problem here. In the examples I shared, the number of
partitions/batches was very different, but the query duration was
mostly independent (almost constant).

FWIW I still haven't seen any explanation why the current code spills
more data than the CP_SMALL_TLIST patch (which was reverted).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Disk-based hash aggregate's cost model

From

Tomas Vondra

Date:

30 August 2020, 15:03:15

On Sun, Aug 30, 2020 at 02:26:20AM +0200, Tomas Vondra wrote:
>On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote:
>>On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote:
>>>We have a Postgres 13 open item for Disk-based hash aggregate, which
>>>is the only non-trivial open item. There is a general concern about
>>>how often we get disk-based hash aggregation when work_mem is
>>>particularly low, and recursion seems unavoidable. This is generally
>>>thought to be a problem in the costing.
>>
>>We discussed two approaches to tweaking the cost model:
>>
>>1. Penalize HashAgg disk costs by a constant amount. It seems to be
>>chosen a little too often, and we can reduce the number of plan
>>changes.
>>
>>2. Treat recursive HashAgg spilling skeptically, and heavily penalize
>>recursive spilling.
>>
>>The problem with approach #2 is that we have a default hash mem of 4MB,
>>and real systems have a lot more than that. In this scenario, recursive
>>spilling can beat Sort by a lot.
>>
>
>I think the main issue is that we're mostly speculating what's wrong.
>I've shared some measurements and symptoms, and we've discussed what
>might be causing it, but I'm not really sure we know for sure.
>
>I really dislike (1) because it seems more like "We don't know what's
>wrong so we'll penalize hashagg," kind of solution. A much more
>principled solution would be to tweak the costing accordingly, not just
>by adding some constant. For (2) it really depends if recursive spilling
>is really the problem here. In the examples I shared, the number of
>partitions/batches was very different, but the query duration was
>mostly independent (almost constant).
>

I've decided to look at the costing a bit more closely today, and see
why the costing is so much lower compared to sort/groupagg. I've used
the same 32GB dataset and query as in [1].

I've repeated tests for all the work_mem values, and I see the number of
batches are much lower, probably thanks to the HLL improvement:

       2MB    Planned:  64    Batches (old):  4160    Batches:  2977
       4MB    Planned: 128    Batches (old): 16512    Batches:  1569
       8MB    Planned: 256    Batches (old): 21488    Batches:  1289
      64MB    Planned:  32    Batches (old):  2720    Batches:   165
     256MB    Planned:   8    Batches (old):     8    Batches:    41

The impact on duration of the queries seems pretty negligible, though.


The plans with work_mem=64MB look like this:

1) hashagg

                                                                QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=11267293.86..11267293.86 rows=1 width=36) (actual time=213127.515..213127.517 rows=0 loops=1)
    ->  HashAggregate  (cost=10229061.10..11267293.86 rows=6718533 width=36) (actual time=100862.623..212948.642
rows=6400000loops=1)
 
          Group Key: lineitem.l_partkey
          Planned Partitions: 32  Batches: 165  Memory Usage: 67857kB  Disk Usage: 6802432kB
          ->  Seq Scan on lineitem  (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.418..20344.631
rows=192000551loops=1)
 
  Planning Time: 0.064 ms
  Execution Time: 213441.986 ms
(7 rows)

2) groupagg

                                                                   QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------
  Limit  (cost=36755617.81..36755617.81 rows=1 width=36) (actual time=180029.594..180029.595 rows=0 loops=1)
    ->  GroupAggregate  (cost=35214909.30..36755617.81 rows=6718533 width=36) (actual time=94017.788..179857.683
rows=6400000loops=1)
 
          Group Key: lineitem.l_partkey
          ->  Sort  (cost=35214909.30..35694886.14 rows=191990736 width=9) (actual time=94017.750..151138.727
rows=192000551loops=1)
 
                Sort Key: lineitem.l_partkey
                Sort Method: external merge  Disk: 3742208kB
                ->  Seq Scan on lineitem  (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.008..26831.264
rows=192000551loops=1)
 
  Planning Time: 0.063 ms
  Execution Time: 180209.580 ms
(9 rows)


I still don't understand why the hashagg is costed so low compared to
the sort (only about 1/3 of it), because both cases use exactly the same
estimates internally. cost_tuplesort ends up with

     npages = 937455
     nruns  = 114.435396
     input_bytes = 7679629440
     log_runs = 1.0

while cost_agg uses

     pages_read = 937455
     pages_written = 937455
     relation_size = 7679629440;

yet we end up with much lower estimate for hashagg. It however does seem
to me this is mostly due to non-I/O costs, considered by cost_tuplesort
and perhaps ignored by cost_agg? In particular, most of the sort cost
comes from this

     *startup_cost = comparison_cost * tuples * LOG2(tuples);

So I'm wondering if the hashagg is not ignoring similar non-I/O costs
for the spilling case. In particular, the initial section computing
startup_cost seems to ignore that we may need to so some of the stuff
repeatedly - for example we'll repeat hash lookups for spilled tuples,
and so on.

The other thing is that sort seems to be doing only about half the
physical I/O (as measured by iosnoop) compared to hashagg, even though
the estimates of pages / input_bytes are exactly the same. For hashagg
the iosnoop shows 5921MB reads and 7185MB writes, while sort only does
2895MB reads and 3655MB writes. Which kinda matches the observed sizes
of temp files in the two cases, so the input_bytes for sort seems to be
a bit overestimated.


regards

[1] https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development


-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Disk-based hash aggregate's cost model

From

Jeff Davis

Date:

01 September 2020, 06:34:34

On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote:
> So I'm wondering if the hashagg is not ignoring similar non-I/O costs
> for the spilling case. In particular, the initial section computing
> startup_cost seems to ignore that we may need to so some of the stuff
> repeatedly - for example we'll repeat hash lookups for spilled
> tuples,
> and so on.

To fix that, we'd also need to change the cost of in-memory HashAgg,
right?

> The other thing is that sort seems to be doing only about half the
> physical I/O (as measured by iosnoop) compared to hashagg, even
> though
> the estimates of pages / input_bytes are exactly the same. For
> hashagg
> the iosnoop shows 5921MB reads and 7185MB writes, while sort only
> does
> 2895MB reads and 3655MB writes. Which kinda matches the observed
> sizes
> of temp files in the two cases, so the input_bytes for sort seems to
> be
> a bit overestimated.

Hmm, interesting.

How reasonable is it to be making these kinds of changes to the cost
model right now? I think your analysis is solid, but I'm worried about
making more intrusive changes very late in the cycle.

I had originally tried to limit the cost model changes to the new plans
I am introducing -- that is, HashAgg plans expected to require disk.
That's why I came up with a rather arbitrary penalty.

Regards,
    Jeff Davis

Re: Disk-based hash aggregate's cost model

From

Tomas Vondra

Date:

01 September 2020, 09:19:03

On Mon, Aug 31, 2020 at 11:34:34PM -0700, Jeff Davis wrote:
>On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote:
>> So I'm wondering if the hashagg is not ignoring similar non-I/O costs
>> for the spilling case. In particular, the initial section computing
>> startup_cost seems to ignore that we may need to so some of the stuff
>> repeatedly - for example we'll repeat hash lookups for spilled
>> tuples,
>> and so on.
>
>To fix that, we'd also need to change the cost of in-memory HashAgg,
>right?
>

Why? I don't think we need to change costing of in-memory HashAgg. My
assumption was we'd only tweak startup_cost for cases with spilling by
adding something like (cpu_operator_cost * npartitions * ntuples).

>> The other thing is that sort seems to be doing only about half the
>> physical I/O (as measured by iosnoop) compared to hashagg, even
>> though
>> the estimates of pages / input_bytes are exactly the same. For
>> hashagg
>> the iosnoop shows 5921MB reads and 7185MB writes, while sort only
>> does
>> 2895MB reads and 3655MB writes. Which kinda matches the observed
>> sizes
>> of temp files in the two cases, so the input_bytes for sort seems to
>> be
>> a bit overestimated.
>
>Hmm, interesting.
>

FWIW I suspect some of this difference may be due to logical vs.
physical I/O. iosnoop only tracks physical I/O sent to the device, but
maybe we do much more logical I/O and it simply does not expire from
page cache for the sort. It might behave differently for larger data
set, longer query, ...

>How reasonable is it to be making these kinds of changes to the cost
>model right now? I think your analysis is solid, but I'm worried about
>making more intrusive changes very late in the cycle.
>
>I had originally tried to limit the cost model changes to the new plans
>I am introducing -- that is, HashAgg plans expected to require disk.
>That's why I came up with a rather arbitrary penalty.
>

I don't know. I certainly understand the desire not to change things
this late. OTOH I'm worried that we'll end up receiving a lot of poor
plans post release.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Disk-based hash aggregate's cost model

From

Jeff Davis

Date:

01 September 2020, 19:58:59

On Tue, 2020-09-01 at 11:19 +0200, Tomas Vondra wrote:
> Why? I don't think we need to change costing of in-memory HashAgg. My
> assumption was we'd only tweak startup_cost for cases with spilling
> by
> adding something like (cpu_operator_cost * npartitions * ntuples).

The code above (the in-memory case) has a clause:

  startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;

which seems to account only for the hash calculation, because it's
multiplying by the number of grouping columns.

Your calculation would also use cpu_operator_cost, but just for the
lookup. I'm OK with that, but it's a little inconsistent to only count
it for the tuples that spill to disk.

But why multiply by the number of partitions? Wouldn't it be the depth?
A wide fanout will not increase the number of lookups.

> FWIW I suspect some of this difference may be due to logical vs.
> physical I/O. iosnoop only tracks physical I/O sent to the device,
> but
> maybe we do much more logical I/O and it simply does not expire from
> page cache for the sort. It might behave differently for larger data
> set, longer query, ...

That would suggest something like a penalty for HashAgg for being a
worse IO pattern. Or do you have another suggestion?

> I don't know. I certainly understand the desire not to change things
> this late. OTOH I'm worried that we'll end up receiving a lot of poor
> plans post release.

I was reacting mostly to changing the cost of Sort. Do you think
changes to Sort are required or did I misunderstand?

Regards,
    Jeff Davis

Re: Disk-based hash aggregate's cost model

From

Tomas Vondra

Date:

01 September 2020, 21:19:30

On Tue, Sep 01, 2020 at 12:58:59PM -0700, Jeff Davis wrote:
>On Tue, 2020-09-01 at 11:19 +0200, Tomas Vondra wrote:
>> Why? I don't think we need to change costing of in-memory HashAgg. My
>> assumption was we'd only tweak startup_cost for cases with spilling
>> by
>> adding something like (cpu_operator_cost * npartitions * ntuples).
>
>The code above (the in-memory case) has a clause:
>
>  startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples;
>
>which seems to account only for the hash calculation, because it's
>multiplying by the number of grouping columns.
>
>Your calculation would also use cpu_operator_cost, but just for the
>lookup. I'm OK with that, but it's a little inconsistent to only count
>it for the tuples that spill to disk.
>
>But why multiply by the number of partitions? Wouldn't it be the depth?
>A wide fanout will not increase the number of lookups.
>

Yeah, I think you're right it should be depth, not number of partitions.

FWIW I don't know if this is enough to "fix" the costing, it's just
something I noticed while looking at the code.

>> FWIW I suspect some of this difference may be due to logical vs.
>> physical I/O. iosnoop only tracks physical I/O sent to the device,
>> but
>> maybe we do much more logical I/O and it simply does not expire from
>> page cache for the sort. It might behave differently for larger data
>> set, longer query, ...
>
>That would suggest something like a penalty for HashAgg for being a
>worse IO pattern. Or do you have another suggestion?
>

Possibly, yes. I think it'd be good to measure logical I/O (e.g. by
adding some instrumentation to LogicalTapeSet) to see if this hypothesis
is actually true.

FWIW any thoughts about the different in temp size compared to
CP_SMALL_TLIST?

>> I don't know. I certainly understand the desire not to change things
>> this late. OTOH I'm worried that we'll end up receiving a lot of poor
>> plans post release.
>
>I was reacting mostly to changing the cost of Sort. Do you think
>changes to Sort are required or did I misunderstand?
>

Not sure I'm following. I don't think anyone proposed changing costing
for Sort. Or did I miss something?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Disk-based hash aggregate's cost model

From

Peter Geoghegan

Date:

01 September 2020, 21:48:07

On Tue, Sep 1, 2020 at 2:19 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> FWIW I suspect some of this difference may be due to logical vs.
> physical I/O. iosnoop only tracks physical I/O sent to the device, but
> maybe we do much more logical I/O and it simply does not expire from
> page cache for the sort. It might behave differently for larger data
> set, longer query, ...

There is also the fact that the LogicalTapeSetBlocks() instrumentation
is known to have problems that we still need to fix:

https://www.postgresql.org/message-id/CAH2-Wzn5PCBLUrrds=hD439LtWP+PD7ekRTd=8LdtqJ+KO5D1Q@mail.gmail.com

I'm not suggesting that this is a significant factor here. But I can't
rule it out just yet either.

> I don't know. I certainly understand the desire not to change things
> this late. OTOH I'm worried that we'll end up receiving a lot of poor
> plans post release.

I think that this needs to get fixed before release.

-- 
Peter Geoghegan

Re: Disk-based hash aggregate's cost model

From

Jeff Davis

Date:

03 September 2020, 00:18:23

On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote:
> So I'm wondering if the hashagg is not ignoring similar non-I/O costs
> for the spilling case. In particular, the initial section computing
> startup_cost seems to ignore that we may need to so some of the stuff
> repeatedly - for example we'll repeat hash lookups for spilled
> tuples,
> and so on.

I tried to analyze this using a slightly different approach: cost units
per second of runtime. Obviously this will vary based on the machine,
but it's interesting anyway.

All of the following fit in system memory. Schema, data, and queries
are at the end of this email.

A low value of cost-units/second-runtime means "more likely to be
chosen incorrectly" and a high value means "more likely to be missed
incorrectly".

Plan    | work_mem | 10M  | 100M INT4    | 100M | 10M
        |          | INT4 | (10M groups) | INT4 | TEXT
--------+----------+------+--------------+------+-----
HashAgg | 4MB      |  88  |  63          |  82  | 78
HashAgg | 1TB      |  41  |  37          |  33  | 38
Sort    | 4MB      | 182  | 188          | 174  | 37
Sort    | 1TB      | 184  | 231          | 189  | 30

Sort is consistent for a given datatype, but it seems that the
comparison cost is far from proportionate between data types.

HashAgg is consistent between the data types, but the disk costs play a
larger role (in this case, there is no actual disk access to worry
about, because it fits in system memory).

At least for these simple queries, Sort is punished unfairly for INT4,
but gets an unfair boost for TEXT.

It seems that we should make a change here, but to be conservative for
13, we should limit changes to the new plans, which are the first line
(HashAgg that spills).

The attached patch makes two adjustments: a 2X disk penalty for
HashAgg, and I also add:

  spill_cost = depth * input_tuples * 2.0 * cpu_tuple_cost

The new results:

Plan    | work_mem | 10M  | 100M INT4    | 100M | 10M
        |          | INT4 | (10M groups) | INT4 | TEXT
--------+----------+------+--------------+------+-----
HashAgg | 4MB      | 192  | 131          | 178  | 166

That's much more comparable to Sort for INT4, but makes the gap wider
for TEXT. Fortunately, at least for my simple queries, it just barely
avoids a plan change to Sort for the TEXT case (which is nearly 5X
slower than HashAgg).

I think this approach is reasonable for 13: it only changes the costing
for HashAgg plans that are expected to spill, it's fairly conservative
so we will not get lots of new bad plans, and it still seems to use
HashAgg for cases where it clearly should.

Note: in-memory HashAgg is still unfairly favored over Sort, at least
for these cases.

Regards,
    Jeff Davis

create table t10m(i int);
insert into t10m select (random()*1000000000)::int from
generate_series(1,10000000);
explain analyze select distinct i from t10m;

create table t100m(i int);
insert into t100m select (random()*2000000000)::int from
generate_series(1,100000000);
explain analyze select distinct i from t100m;

-- 100m tuples, 10m groups, 10tuples/group
create table t100m_10(i int);
insert into t100m_10 select (random()*10000000)::int from
generate_series(1,100000000);
explain analyze select distinct i from t100m_10;

create table text10m(t text collate "C.UTF-8", i int, n numeric);
insert into text10m select s.g::text, s.g, s.g::numeric from (select
(random()*1000000000)::int as g from generate_series(1,10000000)) s;
explain analyze select distinct t from text10m;

Attachment

hashagg-cost.patch

Re: Disk-based hash aggregate's cost model

From

Peter Geoghegan

Date:

03 September 2020, 00:35:13

On Wed, Sep 2, 2020 at 5:18 PM Jeff Davis <pgsql@j-davis.com> wrote:
> create table text10m(t text collate "C.UTF-8", i int, n numeric);
> insert into text10m select s.g::text, s.g, s.g::numeric from (select
> (random()*1000000000)::int as g from generate_series(1,10000000)) s;
> explain analyze select distinct t from text10m;

Note that you won't get what Postgres considers to be the C collation
unless you specify "collate C" -- "C.UTF-8" is the C collation exposed
by glibc. The difference matters a lot, because only the former can
use abbreviated keys (unless you manually #define TRUST_STRXFRM). And
even without abbreviated keys it's probably still significantly faster
for other reasons.

This doesn't undermine your point, because we don't take the
difference into account in cost_sort() -- even though abbreviated keys
will regularly make text sorts 2x-3x faster. My point is only that it
would be more accurate to say that the costing unfairly boosts sorts
on collated texts specifically. Though maybe not when an ICU collation
is used (since abbreviated keys will be enabled generally).

-- 
Peter Geoghegan

Re: Disk-based hash aggregate's cost model

From

Jeff Davis

Date:

03 September 2020, 16:38:06

On Wed, 2020-09-02 at 17:35 -0700, Peter Geoghegan wrote:
> On Wed, Sep 2, 2020 at 5:18 PM Jeff Davis <pgsql@j-davis.com> wrote:
> > create table text10m(t text collate "C.UTF-8", i int, n numeric);
> > insert into text10m select s.g::text, s.g, s.g::numeric from
> > (select
> > (random()*1000000000)::int as g from generate_series(1,10000000))
> > s;
> > explain analyze select distinct t from text10m;
> 
> Note that you won't get what Postgres considers to be the C collation
> unless you specify "collate C" -- "C.UTF-8" is the C collation
> exposed
> by glibc. The difference matters a lot, because only the former can
> use abbreviated keys (unless you manually #define TRUST_STRXFRM). And
> even without abbreviated keys it's probably still significantly
> faster
> for other reasons.

Thank you. I reran with:

  create table text10m2(t text collate "C", i int, n numeric);
  -- same data, same queries

And the new table is:

Plan     | work | 10M  | 100M INT4 | 100M | 10M  | 10M
         |  mem | INT4 | 10M grps  | INT4 | TEXT | TEXTC
---------+------+------+-----------+------+------+------
HashAgg  | 4MB  |  88  |  63       |  82  |  78  |  80
HashAgg  | 1TB  |  41  |  37       |  33  |  38  |  43
Sort     | 4MB  | 182  | 188       | 174  |  37  | 146
Sort     | 1TB  | 184  | 231       | 189  |  30  | 149
HashAgg* | 4MB  | 192  | 131       | 178  | 166  | 176

*: patched

For the 'COLLATE "C"' case, the costs still come out the almost the
same between HashAgg and Sort, but the runtimes are much closer. So
even if it did flip the plan from HashAgg to Sort, it goes from 9.5s
(HashAgg) to 12s (Sort), which is not so bad.

So the patched version looks good to me at this point. It accounts for
Tomas's observations about IO:

  "The other thing is that sort seems to be doing only about half the
physical I/O (as measured by iosnoop) compared to hashagg, even though
the estimates of pages / input_bytes are exactly the same."

by penalizing HashAgg disk costs by 2X.

The patch also accounts for his other observation about missing CPU
costs by costing the spilling. Tomas framed the CPU costs as the cost
of the extra lookups, but the extra lookups are only in the cases where
it misses in the hash table and needs to spill. So I think it's
reasonable to consider the extra lookups as a part of the spill cost.

The remaining problems are:

* comparison costs for Sort should be adjusted to make them relatively
consistent between data types
* in-memory HashAgg is unfairly favored in a lot of cases

I don't think either of those problems need to be (or should be) fixed
in 13, but we can revisit in 14 if there are reports of bad plans.

Regards,
    Jeff Davis

Re: Disk-based hash aggregate's cost model

From

Jeff Davis

Date:

04 September 2020, 00:53:43

On Tue, 2020-09-01 at 23:19 +0200, Tomas Vondra wrote:
> FWIW any thoughts about the different in temp size compared to
> CP_SMALL_TLIST?

Are you referring to results from a while ago? In this thread I don't
see what you're referring to.

I tried in a simple case on REL_13_STABLE, with and without the
CP_SMALL_TLIST change, and I saw only a tiny difference. Do you have a
current case that shows a larger difference?

The only thing I can think of that might change is the size of the null
bitmap or how fields are aligned.

Regards,
    Jeff Davis

Re: Disk-based hash aggregate's cost model

From

Tomas Vondra

Date:

04 September 2020, 12:56:40

On Thu, Sep 03, 2020 at 05:53:43PM -0700, Jeff Davis wrote:
>On Tue, 2020-09-01 at 23:19 +0200, Tomas Vondra wrote:
>> FWIW any thoughts about the different in temp size compared to
>> CP_SMALL_TLIST?
>
>Are you referring to results from a while ago? In this thread I don't
>see what you're referring to.
>
>I tried in a simple case on REL_13_STABLE, with and without the
>CP_SMALL_TLIST change, and I saw only a tiny difference. Do you have a
>current case that shows a larger difference?
>

I'm referring to the last charts in the message from July 27, comparing
behavior with CP_SMALL_TLIST fix vs. master (which reverted/replaced the
CP_SMALL_TLIST bit).

Those charts show that the CP_SMALL_TLIST resulted in smaller temp files
(per EXPLAIN ANALYZE the difference is ~25%) and also lower query
durations (also in the ~25% range).

I can repeat those tests, if needed.

[1] https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development

>The only thing I can think of that might change is the size of the null
>bitmap or how fields are aligned.
>

Maybe. Not sure.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Disk-based hash aggregate's cost model

From

Jeff Davis

Date:

04 September 2020, 18:31:36

On Fri, 2020-09-04 at 14:56 +0200, Tomas Vondra wrote:
> Those charts show that the CP_SMALL_TLIST resulted in smaller temp
> files
> (per EXPLAIN ANALYZE the difference is ~25%) and also lower query
> durations (also in the ~25% range).

I was able to reproduce the problem, thank you.

Only two attributes are needed, so the CP_SMALL_TLIST projected schema
only needs a single-byte null bitmap.

But if just setting the attributes to NULL rather than projecting them,
the null bitmap size is based on all 16 attributes, bumping the bitmap
size to two bytes.

MAXALIGN(23 + 1) = 24
MAXALIGN(23 + 2) = 32

I think that explains it. It's not ideal, but projection has a cost as
well, so I don't think we necessarily need to do something here.

If we are motivated to improve this in v14, we could potentially have a
different schema for spilled tuples, and perform real projection at
spill time. But I don't know if that's worth the extra complexity.

Regards,
    Jeff Davis

Re: Disk-based hash aggregate's cost model

From

Tomas Vondra

Date:

04 September 2020, 19:01:37

On Fri, Sep 04, 2020 at 11:31:36AM -0700, Jeff Davis wrote:
>On Fri, 2020-09-04 at 14:56 +0200, Tomas Vondra wrote:
>> Those charts show that the CP_SMALL_TLIST resulted in smaller temp
>> files
>> (per EXPLAIN ANALYZE the difference is ~25%) and also lower query
>> durations (also in the ~25% range).
>
>I was able to reproduce the problem, thank you.
>
>Only two attributes are needed, so the CP_SMALL_TLIST projected schema
>only needs a single-byte null bitmap.
>
>But if just setting the attributes to NULL rather than projecting them,
>the null bitmap size is based on all 16 attributes, bumping the bitmap
>size to two bytes.
>
>MAXALIGN(23 + 1) = 24
>MAXALIGN(23 + 2) = 32
>
>I think that explains it. It's not ideal, but projection has a cost as
>well, so I don't think we necessarily need to do something here.
>
>If we are motivated to improve this in v14, we could potentially have a
>different schema for spilled tuples, and perform real projection at
>spill time. But I don't know if that's worth the extra complexity.
>

Thanks for the investigation and explanation.

Wouldn't it be enough to just use a slot with smaller tuple descriptor?
All we'd need to do is creating the descriptor in ExecInitAgg after
calling find_hash_columns, and using it for rslot/wslot, and then
"mapping" the attributes in hashagg_spill_tuple (which already almost
does that, to the extra cost should be 0) and when reading the spilled
tuples. So I'm not quite buying the argument that this would make
measurable difference ...

That being said, I won't insist on fixing this in v13 - at least we know
what the issue is and we can fix it later. The costing seems like a more
serious open item.

OTOH I don't think this example is particularly extreme, and I wouldn't
be surprised if we se even worse examples in practice - tables tend to
be quite wide and aggregation of just a few columns seems likely.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Disk-based hash aggregate's cost model

From

Jeff Davis

Date:

04 September 2020, 19:33:24

On Fri, 2020-09-04 at 21:01 +0200, Tomas Vondra wrote:
> Wouldn't it be enough to just use a slot with smaller tuple
> descriptor?
> All we'd need to do is creating the descriptor in ExecInitAgg after
> calling find_hash_columns, and using it for rslot/wslot, and then
> "mapping" the attributes in hashagg_spill_tuple (which already almost
> does that, to the extra cost should be 0) and when reading the
> spilled
> tuples.

That's a good point, it's probably not much code to make it work.

> So I'm not quite buying the argument that this would make
> measurable difference ...

I meant "projection of all input tuples" (i.e. CP_SMALL_TLIST) has a
cost. If we project only at spill time, it should be fine.

Regards,
    Jeff Davis

Re: Disk-based hash aggregate's cost model

From

Tomas Vondra

Date:

06 September 2020, 21:21:12

Hi,

I've tested the costing changes on the simplified TPC-H query, on two
different machines, and it seems like a clear improvement.

This is using the same cost/duration measure, which I think is pretty
neat way to look at this. Sure, it's imperfect (depends on which cost
and durations you actually take etc.), but it makes the comparisons
easier and for simple queries it does not matter that much.

The costing clearly depends on parameters like random_page_cost and how
it matches the hardware, but for the machine with SSD and default
random_page_cost the effect looks like this:

      work_mem    sort    master    patched
     ---------------------------------------
           1MB     249        95        215
           2MB     256        89        187
           4MB     233        90        192
           8MB     227        70        124
          16MB     245        67        118
          32MB     261        63        111
          64MB     256        59        104
         256MB     266        55        102

and with random_page_cost reduced to 1.5 it looks like this:

      work_mem       sort    master    patched
     ------------------------------------------
           1MB        221        63        150
           2MB        227        64        133
           4MB        214        65        137
           8MB        214        57         95
          16MB        232        53         90
          32MB        247        50         85
          64MB        249        47         80
         256MB        258        46         77

And on a machine with SATA RAID storage it looks like this:

      work_mem       sort    master   patched
     -----------------------------------------
           1MB        102        41        94
           2MB        101        34        77
           4MB         99        35        78
           8MB         98        35        79
          16MB         98        25        50
          32MB        106        26        51
          64MB        106        26        51
         256MB        105        29        50

So yeah, the patched costing is much closer to sort (from the point of
this cost/duration metric), although for higher work_mem values there's
still a clear gap where the hashing seems to be under-costed by a factor
of ~2 or more.

I think this is simply showing that sort may the effect of increasing
work_mem is much more pronounced for sort/groupagg compared to hashagg.
For example on the SDD machine the duration changes like this:

     work_mem    hashagg    groupagg
    ---------------------------------
          1MB        217         201
          2MB        178         195
          4MB        176         186
          8MB        160         176
         16MB        168         163
         32MB        180         153
         64MB        189         143
        256MB        204         138

and the SATA RAID storage seems to behave in a similar way (although the
difference is smaller).

So in general I think this costing change is reasonable. It might not go
far enough, but it certainly makes it probably makes it easier to tweak
the rest by changing random_page_cost etc. Not sure if we need/should
tweak the costing to reduce the effect of work_mem (on hashagg).

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Disk-based hash aggregate's cost model

From

Jeff Davis

Date:

07 September 2020, 20:55:28

On Sun, 2020-09-06 at 23:21 +0200, Tomas Vondra wrote:
> I've tested the costing changes on the simplified TPC-H query, on two
> different machines, and it seems like a clear improvement.

Thank you. Committed.

> So yeah, the patched costing is much closer to sort (from the point
> of
> this cost/duration metric), although for higher work_mem values
> there's
> still a clear gap where the hashing seems to be under-costed by a
> factor
> of ~2 or more.

There seems to be a cliff right after 4MB. Perhaps lookup costs on a
larger hash table?

Regards,
    Jeff Davis

Re: Disk-based hash aggregate's cost model

From

Tomas Vondra

Date:

08 September 2020, 14:06:58

On Mon, Sep 07, 2020 at 01:55:28PM -0700, Jeff Davis wrote:
>On Sun, 2020-09-06 at 23:21 +0200, Tomas Vondra wrote:
>> I've tested the costing changes on the simplified TPC-H query, on two
>> different machines, and it seems like a clear improvement.
>
>Thank you. Committed.
>
>> So yeah, the patched costing is much closer to sort (from the point
>> of
>> this cost/duration metric), although for higher work_mem values
>> there's
>> still a clear gap where the hashing seems to be under-costed by a
>> factor
>> of ~2 or more.
>
>There seems to be a cliff right after 4MB. Perhaps lookup costs on a
>larger hash table?
>

I assume you mean higher costs due to hash table outgrowing some sort of
CPU cache (L2/L3), right? Good guess - the CPU has ~6MB cache, but no.
This seems to be merely due to costing, because the raw cost/duration
looks like this:

      work_mem       cost    duration
     ---------------------------------
           1MB   20627403      216861
           2MB   15939722      178237
           4MB   15939722      176296
           8MB   11252041      160479
          16MB   11252041      168304
          32MB   11252041      179567
          64MB   11252041      189410
         256MB   11252041      204206

This is unpatched master, with the costing patch it looks similar except
that the cost is about 2x higher. On the SATA RAID machine, it looks
like this:

      work_mem         cost    duration
     -----------------------------------
           1MB    108358461     1147269
           2MB     77381688     1004895
           4MB     77381688      994853
           8MB     77381688      980071
          16MB     46404915      930511
          32MB     46404915      902167
          64MB     46404915      908757
         256MB     46404915      926862

So roughly the same - the cost drops to less than 50%, but the duration
really does not. This is what I referred to when I said "Not sure if we
need/should tweak the costing to reduce the effect of work_mem (on
hashagg)."

For sort this seems to behave a bit more nicely - the cost and duration
(with increasing work_mem) are correlated quite well, I think.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services