Thread: Trouble with hashagg spill I/O pattern and costing

Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
Hi,

I've been running some TPC-H benchmarks on master, to check if there's
something unexpected, and I ran into some annoying issues with Q17 and
Q20. I'll use Q17 as it's a bit simpler.

I think there are two related problem - with costing and with excessive
I/O due to using logical tapes.

Let's talk about the costing first. On 75GB scale (with disabled parallel
query), the execution plan looks like this:

                                                       QUERY PLAN
     ---------------------------------------------------------------------------------------------------------------
      Limit  (cost=16997740.10..16997740.12 rows=1 width=32)
        ->  Aggregate  (cost=16997740.10..16997740.12 rows=1 width=32)
              ->  Nested Loop  (cost=14204895.82..16997574.11 rows=66397 width=8)
                    Join Filter: (part.p_partkey = lineitem.l_partkey)
                    ->  Hash Join  (cost=14204895.25..16251060.84 rows=6640 width=40)
                          Hash Cond: (lineitem_1.l_partkey = part.p_partkey)
                          ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36)
                                Group Key: lineitem_1.l_partkey
                                Planned Partitions: 128
                                ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9)
                          ->  Hash  (cost=227058.33..227058.33 rows=6846 width=4)
                                ->  Seq Scan on part  (cost=0.00..227058.33 rows=6846 width=4)
                                      Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                    ->  Index Scan using idx_lineitem_part_supp on lineitem  (cost=0.57..112.30 rows=10 width=17)
                          Index Cond: (l_partkey = lineitem_1.l_partkey)
                          Filter: (l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
     (16 rows)

and if I disale hash aggregate (or spill to disk), it changes to this:

                                                            QUERY PLAN
     
 

-------------------------------------------------------------------------------------------------------------------------
      Limit  (cost=44577524.39..44577524.40 rows=1 width=32)
        ->  Aggregate  (cost=44577524.39..44577524.40 rows=1 width=32)
              ->  Merge Join  (cost=41772792.17..44577358.39 rows=66397 width=8)
                    Merge Cond: (lineitem_1.l_partkey = part.p_partkey)
                    Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
                    ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36)
                          Group Key: lineitem_1.l_partkey
                          ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9)
                                Sort Key: lineitem_1.l_partkey
                                ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9)
                    ->  Materialize  (cost=1.00..1191105.89 rows=205371 width=21)
                          ->  Nested Loop  (cost=1.00..1190592.46 rows=205371 width=21)
                                ->  Index Scan using part_pkey on part  (cost=0.43..329262.21 rows=6846 width=4)
                                      Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                                ->  Index Scan using idx_lineitem_part_supp on lineitem  (cost=0.57..125.51 rows=31
width=17)
                                      Index Cond: (l_partkey = part.p_partkey)
     (16 rows)

The problem is that the hashagg plan runs in ~1400 seconds, while the
groupagg only takes ~360. And per explain analyze, the difference really
is in the aggregation - if we subtract the seqscan, the sort+groupagg
takes about 310s:

     ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36) (actual time=283378.004..335611.192
rows=6398981loops=1)
 
           Group Key: lineitem_1.l_partkey
           ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9) (actual time=283377.977..306182.393
rows=191969841loops=1)
 
                 Sort Key: lineitem_1.l_partkey
                 Sort Method: external merge  Disk: 3569544kB
                 ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual
time=0.019..28253.076rows=192000551 loops=1)
 

while the hashagg takes ~1330s:

     ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36) (actual time=202952.170..1354546.897
rows=6400000loops=1)
 
           Group Key: lineitem_1.l_partkey
           Planned Partitions: 128
           Peak Memory Usage: 4249 kB
           Disk Usage: 26321840 kB
           HashAgg Batches: 16512
           ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual
time=0.007..22205.617rows=192000551 loops=1)
 

And that's while only writing 26GB, compared to 35GB in the sorted plan,
and with cost being ~16M vs. ~43M (so roughly inverse).

OK, let's make the hashagg plan more expensive - that'll fix it, right?.
But how do you do that? I might lower the work_mem, say from 4MB to 1MB,
which gets us from ~16M

     ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36)
           Group Key: lineitem_1.l_partkey
           Planned Partitions: 128
           ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9)

to ~20M (I'm a bit surprised that the planned partitions dropped 4x, but
I suspect there's an explanation for that).

      ->  HashAggregate  (cost=17727162.59..20632321.45 rows=6206695 width=36)
            Group Key: lineitem_1.l_partkey
            Planned Partitions: 32
            ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9)

Anyway, this did not really solve anything, apparently. The cost is
still much lower than for groupagg, and moreover I don't want to lower
work_mem - I want to increase cost for a given work_mem value. And it
also increases the sort cost estimate, of course.

As I'll show in a minute, I believe most of this is due to I/O pattern
for the logical tapes, which is very different between sort and hashagg.
So it'd be natural to consider seq_page_cost/random_page_cost on the
temp tablespace, but that's not how it works - we just ignore that :-(


Why do I think this is due to a difference in I/O pattern on the logical
tape set? I've moved the temporary tablespace to a separate SSD device,
and used iosnoop [1] to collect all I/O requests associated with this
query. Attached are four charts showing blocks (sectors) accessed over
time, both for the groupagg and hashagg plans.


1) sort + groupagg

For groupagg (tempio-sort.png) the chart looks a bit chaotic, but it's
reasonable - it shows the sort does merges, etc. Nothing particularly
surprising, IMHO.

It's also interesting to look at statistics of block sizes, and deltas
of the blocks, for different request types. Showing the most common
block sizes look something like this (the last column is percentage
of all requests with the same request type):

      type |  bytes  | count |   pct  
     ------+---------+-------+-------
      RA   |  131072 | 26034 | 59.92
      RA   |   16384 |  6160 | 14.18
      RA   |    8192 |  3636 |  8.37
      RA   |   32768 |  3406 |  7.84
      RA   |   65536 |  3270 |  7.53
      RA   |   24576 |   361 |  0.83
      ...
      W    | 1310720 |  8070 | 34.26
      W    |  262144 |  1213 |  5.15
      W    |  524288 |  1056 |  4.48
      W    | 1056768 |   689 |  2.93
      W    |  786432 |   292 |  1.24
      W    |  802816 |   199 |  0.84
      ...

The writes are buffered and so are done by kworkers, which seem to be
able to coalesce them into fairly large chunks (e.g. 34% are 1280kB).
The reads come from the postgres backend, and generally are 128kB reads.

The deltas (in 512B sectors) are mostly consistent with this:

      type | block_delta | count |   pct 
     ------+-------------+-------+-------
      RA   |         256 | 13432 | 30.91
      RA   |          16 |  3291 |  7.57
      RA   |          32 |  3272 |  7.53
      RA   |          64 |  3266 |  7.52
      RA   |         128 |  2877 |  6.62
      RA   |        1808 |  1278 |  2.94
      RA   |       -2320 |   483 |  1.11
      RA   |       28928 |   386 |  0.89
      ...
      W    |        2560 |  7856 | 33.35
      W    |        2064 |  4921 | 20.89
      W    |        2080 |   586 |  2.49
      W    |       30960 |   300 |  1.27
      W    |        2160 |   253 |  1.07
      W    |        1024 |   248 |  1.05
      ...

I believe this suggests most of the I/O is pretty sequential. E.g. 31%
of the reads are 256 sectors (128kB) apart, which is proportional to the
128kB reads.


2) hashagg

The I/O pattern is illustrated by the tempion-hash.png chart, and it's
clearly very different from the sort one. We're reading over and over
in a zig-zag way. I'm pretty sure there are ~128 cycles matching the
number of partitions in the explain analyze output, which end up being
interleaved in the temporary file.

But even at the partition level this is not very very sequential - there
are two "zoom" charts showing smaller parts in more detail, and there's
very obvious nested zig-zag pattern.

Also, let's look at the block / delta stats:

      type |  bytes  |  count  |    pct
     ------+---------+---------+--------
      RA   |    8192 | 3087724 |  95.42
      RA   |   24576 |   69511 |   2.15
      RA   |   16384 |   49297 |   1.52
      RA   |   32768 |   15589 |   0.48
      ...
      W    |    8192 |  321089 |  65.72
      W    |   16384 |   74097 |  15.17
      W    |   24576 |   27785 |   5.69
      W    | 1310720 |   16860 |   3.45
      W    |   32768 |   13823 |   2.83
      W    |   40960 |    7771 |   1.59
      W    |   49152 |    4767 |   0.98
      ...

Well, that's not great - we're not really coalescing writes or reads,
everything is pretty much 8kB block. Especially the writes are somewhat
surprising/concerning, because it shows the kernel is unable to combine
the requests etc.

The deltas look very different too:

      type | block_delta | count |   pct
     ------+-------------+-------+-------
      RA   |        2016 | 72399 |  2.24
      RA   |        2032 | 72351 |  2.24
      RA   |        1984 | 72183 |  2.23
      RA   |        2000 | 71964 |  2.22
      RA   |        2048 | 71718 |  2.22
      RA   |        2064 | 71387 |  2.21
      RA   |        1968 | 71363 |  2.21
      RA   |        1952 | 70412 |  2.18
      RA   |        2080 | 70189 |  2.17
      RA   |        2096 | 69568 |  2.15
      RA   |        1936 | 69109 |  2.14
      RA   |        1920 | 67660 |  2.09
      RA   |        2112 | 67248 |  2.08
      RA   |        1904 | 66026 |  2.04
     ...

There's no clear winner matching the block size, or anything. In fact,
it does oscillate around 2000 sectors, i.e. 1MB. And 128 partitions
multiplied by 8kB block per partition is ... 1MB (tadaaaa!).

This however makes any read-ahead attempts ineffective :-(

And let me repeat - this is on a machine with temp tablespace moved to
an SSD, so the random I/O is not entirely terrible. On a different box
with temp tablespace on 3x SATA RAID, the impact is much worse.


This kinda makes me question whether logical tapes are the right tool
for hashagg. I've read the explanation in logtape.c why it's about the
same amount of I/O as using separate files, but IMO that only really
works for I/O patters similar to merge sort - the more I think about
this, the more I'm convinced we should just do what hashjoin is doing.

But maybe I'm wrong, and logical tapes are the best thing we can do
here. But in that case I think we need to improve the costing, so that
it reflects the very different I/O pattern.


[1] https://github.com/brendangregg/perf-tools/blob/master/iosnoop

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Tue, 2020-05-19 at 17:12 +0200, Tomas Vondra wrote:
> I think there are two related problem - with costing and with
> excessive
> I/O due to using logical tapes.

Thank you for the detailed analysis. I am still digesting this
information.

> This kinda makes me question whether logical tapes are the right tool
> for hashagg. I've read the explanation in logtape.c why it's about
> the
> same amount of I/O as using separate files, but IMO that only really
> works for I/O patters similar to merge sort - the more I think about
> this, the more I'm convinced we should just do what hashjoin is
> doing.

Fundamentally, sort writes sequentially and reads randomly; while
HashAgg writes randomly and reads sequentially. 

If the random writes of HashAgg end up fragmented too much on disk,
then clearly the sequential reads are not so sequential anyway. The
only way to avoid fragmentation on disk is to preallocate for the
tape/file.

BufFile (relying more on the OS) would probably do a better job of
preallocating the disk space in a useful way; whereas logtape.c makes
it easier to manage buffers and the overall number of files created
(thereby allowing higher fanout of partitions).

We have a number of possibilities here:

1. Improve costing to reflect that HashAgg is creating more random IOs
than Sort.
2. Reduce the partition fanout in the hopes that the OS does a better
job with readahead.
3. Switch back to BufFile, in which case we probably need to reduce the
fanout for other reasons.
4. Change logtape.c to allow preallocation or to write in larger
blocks.
5. Change BufFile to allow more control over buffer usage, and switch
to that.

#1 or #2 are the least invasive, and I think we can get a satisfactory
solution by combining those.

I saw good results with the high fanout and low work_mem when there is
still a lot of system memory. That's a nice benefit, but perhaps it's
safer to use a lower fanout (which will lead to recursion) until we get
a better handle on the IO patterns.

Perhaps you can try recompiling with a lower max partitions and rerun
the query? How much would we have to lower it for either the cost to
approach reality or the OS readahead to become effective?

Regards,
    Jeff Davis





Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Tue, May 19, 2020 at 09:27:34AM -0700, Jeff Davis wrote:
>On Tue, 2020-05-19 at 17:12 +0200, Tomas Vondra wrote:
>> I think there are two related problem - with costing and with
>> excessive
>> I/O due to using logical tapes.
>
>Thank you for the detailed analysis. I am still digesting this
>information.
>
>> This kinda makes me question whether logical tapes are the right tool
>> for hashagg. I've read the explanation in logtape.c why it's about
>> the
>> same amount of I/O as using separate files, but IMO that only really
>> works for I/O patters similar to merge sort - the more I think about
>> this, the more I'm convinced we should just do what hashjoin is
>> doing.
>
>Fundamentally, sort writes sequentially and reads randomly; while
>HashAgg writes randomly and reads sequentially.
>

Not sure. I think the charts and stats of iosnoop data show that an
awful lot of reads during sort is actually pretty sequential. Moreover,
sort manages to read the data in much larger blocks - 128kB instead of
just 8kB (which is what hashagg seems to be doing).

I wonder why is that and if we could achieve that for hashagg too ...

>If the random writes of HashAgg end up fragmented too much on disk,
>then clearly the sequential reads are not so sequential anyway. The
>only way to avoid fragmentation on disk is to preallocate for the
>tape/file.
>

And if there a way to pre-allocate larger chunks? Presumably we could
assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x 8kB)
instead of just single block. I haven't seen anything like that in
tape.c, though ...

>BufFile (relying more on the OS) would probably do a better job of
>preallocating the disk space in a useful way; whereas logtape.c makes
>it easier to manage buffers and the overall number of files created
>(thereby allowing higher fanout of partitions).
>
>We have a number of possibilities here:
>
>1. Improve costing to reflect that HashAgg is creating more random IOs
>than Sort.

I think we'll need to do something about this, but I think we should try
improving the behavior first and then model the costing based on that.

>2. Reduce the partition fanout in the hopes that the OS does a better
>job with readahead.

I doubt this will make a significant difference. I think the problem is
the partitions end up interleaved way too much in the temp file, and I
don't see how a lower fanout would fix that.

BTW what do you mean when you say "fanout"? Do you mean how fast we
increase the number of partitions, or some parameter in particular?

>3. Switch back to BufFile, in which case we probably need to reduce the
>fanout for other reasons.

Maybe, although that seems pretty invasive post beta1.

>4. Change logtape.c to allow preallocation or to write in larger
>blocks.

I think this is what I suggested above (allocating 16 blocks at a time,
or something). I wonder how wasteful this would be, but I think not very
much. Essentially, with 1024 partitions and pre-allocating space in
128kB chunks, that means 128MB may end up unused, which seems ok-ish,
and I guess we could further restrict that by starting with lower value
and gradually increasing the number. Or something like that ...

>5. Change BufFile to allow more control over buffer usage, and switch
>to that.
>

Maybe. I don't recall what exactly is the issue with buffer usage, but I
think it has the same invasiveness issue as (3). OTOH it's what hashjoin
does, and we've lived with it for ages ...

>#1 or #2 are the least invasive, and I think we can get a satisfactory
>solution by combining those.
>

OK. I think tweaking the costing (and essentially reverting to what 12
does for those queries) is perfectly reasonable. But if we can actually
get some speedup thanks to hashagg, even better.

>I saw good results with the high fanout and low work_mem when there is
>still a lot of system memory. That's a nice benefit, but perhaps it's
>safer to use a lower fanout (which will lead to recursion) until we get
>a better handle on the IO patterns.
>

I don't know how much we can rely on that - once we push some of the
data from page cache, it has the issues I described. The trouble is
people may not have enough memory to keep everything in cache, otherwise
they might just as well bump up work_mem and not spill at all.

>Perhaps you can try recompiling with a lower max partitions and rerun
>the query? How much would we have to lower it for either the cost to
>approach reality or the OS readahead to become effective?
>

I can try that, of course. Which parameters should I tweak / how?

I can also try running it with BufFile, in case you prepare a WIP patch.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
> 
> And if there a way to pre-allocate larger chunks? Presumably we could
> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
> 8kB)
> instead of just single block. I haven't seen anything like that in
> tape.c, though ...

It turned out to be simple (at least a POC) so I threw together a
patch. I just added a 32-element array of block numbers to each tape.
When we need a new block, we retrieve a block number from that array;
or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.

I reproduced the problem on a smaller scale (330M groups, ~30GB of
memory on a 16GB box). Work_mem=64MB. The query is a simple distinct.

Unpatched master:
   Sort: 250s
   HashAgg: 310s
Patched master:
   Sort: 245s
   HashAgg: 262s

That's a nice improvement for such a simple patch. We can tweak the
number of blocks to preallocate, or do other things like double from a
small number up to a maximum. Also, a proper patch would probably
release the blocks back as free when the tape was rewound.

As long as the number of block numbers to preallocate is not too large,
I don't think we need to change the API. It seems fine for sort to do
the same thing, even though there's not any benefit.

Regards,
    Jeff Davis



Attachment

Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:
>On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>>
>> And if there a way to pre-allocate larger chunks? Presumably we could
>> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
>> 8kB)
>> instead of just single block. I haven't seen anything like that in
>> tape.c, though ...
>
>It turned out to be simple (at least a POC) so I threw together a
>patch. I just added a 32-element array of block numbers to each tape.
>When we need a new block, we retrieve a block number from that array;
>or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.
>
>I reproduced the problem on a smaller scale (330M groups, ~30GB of
>memory on a 16GB box). Work_mem=64MB. The query is a simple distinct.
>
>Unpatched master:
>   Sort: 250s
>   HashAgg: 310s
>Patched master:
>   Sort: 245s
>   HashAgg: 262s
>
>That's a nice improvement for such a simple patch. We can tweak the
>number of blocks to preallocate, or do other things like double from a
>small number up to a maximum. Also, a proper patch would probably
>release the blocks back as free when the tape was rewound.
>
>As long as the number of block numbers to preallocate is not too large,
>I don't think we need to change the API. It seems fine for sort to do
>the same thing, even though there's not any benefit.
>

I gave it a try on the machine with temp tablespace on SSD, and I can
confirm it improves performance. I've tried with different work_mem
values and I've also increased the number of pre-allocated blocks to 64
and 128 blocks, and the numbers look like this:

master

                sort      hash
    ----------------------------
       4MB       335      1331
     128MB       220      1208


patched (32)

                sort      hash
    ----------------------------
       4MB       344       685
     128MB       217       641


patched (64)

                 sort      hash
    ----------------------------
       4MB       329       545
     128MB       214       493

patched (128)

                sort      hash
    ----------------------------
       4MB       331       478
     128MB       222       434


I agree that's pretty nice. I wonder how far would we need to go before
reaching a plateau. I'll try this on the other machine with temporary
tablespace on SATA, but that'll take longer.

The I/O pattern changed significantly - it's not visible on the charts,
so I'm not attaching them. But the statistics of block sizes and "gaps"
are pretty clear.


size of I/O requests
--------------------

a) master

      type |  bytes  |  count  |  pct   
     ------+---------+---------+--------
      RA   |    8192 | 2905948 |  95.83
      RA   |   24576 |   63470 |   2.09
      RA   |   16384 |   40155 |   1.32
      W    |    8192 |  149295 |  52.85
      W    |   16384 |   51781 |  18.33
      W    |   24576 |   22247 |   7.88
      W    | 1310720 |   15493 |   5.48
      W    |   32768 |   11856 |   4.20

b) patched, 32 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 247686 |  41.75
      RA   |    8192 |  95746 |  16.14
      RA   |   16384 |  82314 |  13.87
      RA   |   32768 |  82146 |  13.85
      RA   |   65536 |  82126 |  13.84
      W    | 1310720 |  16815 |  52.19
      W    |  262144 |   3628 |  11.26
      W    |  524288 |   2791 |   8.66

c) patched, 64 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 213556 |  56.18
      RA   |    8192 |  47663 |  12.54
      RA   |   16384 |  39358 |  10.35
      RA   |   32768 |  39308 |  10.34
      RA   |   65536 |  39304 |  10.34
      W    | 1310720 |  18132 |  65.27
      W    |  524288 |   3722 |  13.40
      W    |  262144 |    581 |   2.09
      W    | 1048576 |    405 |   1.46
      W    |    8192 |    284 |   1.02

d) patched, 128 blocks

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 200816 |  70.93
      RA   |    8192 |  23640 |   8.35
      RA   |   16384 |  19324 |   6.83
      RA   |   32768 |  19279 |   6.81
      RA   |   65536 |  19273 |   6.81
      W    | 1310720 |  18000 |  65.91
      W    |  524288 |   2074 |   7.59
      W    | 1048576 |    660 |   2.42
      W    |    8192 |    409 |   1.50
      W    |  786432 |    354 |   1.30

Clearly, the I/O requests are much larger - both reads and writes
shifted from 8kB to much larger ones, and the larger the number of
blocks the more significant the shift is. This means the workload is
getting more "sequential" and the write combining / read-ahead becomes
more effective.


deltas between I/O requests
---------------------------

I'll only show reads to save space, it's about the same for writes.

a) master

      type | block_delta | count  |  pct  
     ------+-------------+--------+-------
      RA   |         256 | 569237 | 18.77
      RA   |         240 | 475182 | 15.67
      RA   |         272 | 437260 | 14.42
      RA   |         224 | 328604 | 10.84
      RA   |         288 | 293628 |  9.68
      RA   |         208 | 199530 |  6.58
      RA   |         304 | 181695 |  5.99
      RA   |         192 | 109472 |  3.61
      RA   |         320 | 105211 |  3.47
      RA   |         336 |  57423 |  1.89

b) patched, 32 blocks

      type | block_delta | count  |  pct  
     ------+-------------+--------+-------
      RA   |         256 | 165071 | 27.82
      RA   |          32 |  82129 | 13.84
      RA   |          64 |  82122 | 13.84
      RA   |         128 |  82077 | 13.83
      RA   |          16 |  82042 | 13.83
      RA   |        7440 |  45168 |  7.61
      RA   |        7952 |   9838 |  1.66

c) patched, 64 blocks

      type | block_delta | count  |  pct
     ------+-------------+--------+-------
      RA   |         256 | 173737 | 45.70
      RA   |          32 |  39301 | 10.34
      RA   |          64 |  39299 | 10.34
      RA   |         128 |  39291 | 10.34
      RA   |          16 |  39250 | 10.32
      RA   |       15120 |  21202 |  5.58
      RA   |       15376 |   4448 |  1.17

d) patched, 128 blocks

      type | block_delta | count  |  pct
     ------+-------------+--------+-------
      RA   |         256 | 180955 | 63.91
      RA   |          32 |  19274 |  6.81
      RA   |          64 |  19273 |  6.81
      RA   |         128 |  19264 |  6.80
      RA   |          16 |  19203 |  6.78
      RA   |       30480 |   9835 |  3.47

The way I understand it, this needs to be interpreted together with
block size stats - in a perfectly sequential workload the two stats
would match. For master that's clearly not the case - the most common
read request size is 8kB, but the most common delta is 128kB (256
sectors, which is the read-ahead for the SSD device). The patched
results are much closer, mostly thanks to switching to 128kB reads.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Tue, May 19, 2020 at 05:12:02PM +0200, Tomas Vondra wrote:
>
> ...
>
>The problem is that the hashagg plan runs in ~1400 seconds, while the
>groupagg only takes ~360. And per explain analyze, the difference really
>is in the aggregation - if we subtract the seqscan, the sort+groupagg
>takes about 310s:
>
>    ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36) (actual time=283378.004..335611.192
rows=6398981loops=1)
 
>          Group Key: lineitem_1.l_partkey
>          ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9) (actual time=283377.977..306182.393
rows=191969841loops=1)
 
>                Sort Key: lineitem_1.l_partkey
>                Sort Method: external merge  Disk: 3569544kB
>                ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual
time=0.019..28253.076rows=192000551 loops=1)
 
>
>while the hashagg takes ~1330s:
>
>    ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36) (actual time=202952.170..1354546.897
rows=6400000loops=1)
 
>          Group Key: lineitem_1.l_partkey
>          Planned Partitions: 128
>          Peak Memory Usage: 4249 kB
>          Disk Usage: 26321840 kB
>          HashAgg Batches: 16512
>          ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual
time=0.007..22205.617rows=192000551 loops=1)
 
>
>And that's while only writing 26GB, compared to 35GB in the sorted plan,
>and with cost being ~16M vs. ~43M (so roughly inverse).
>

I've noticed I've actually made a mistake here - it's not 26GB vs. 35GB
in hash vs. sort, it's 26GB vs. 3.5GB. That is, the sort-based plan
writes out *way less* data to the temp file.

The reason is revealed by explain verbose:

   ->  GroupAggregate
         Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
         Group Key: lineitem_1.l_partkey
         ->  Sort
               Output: lineitem_1.l_partkey, lineitem_1.l_quantity
               Sort Key: lineitem_1.l_partkey
               ->  Seq Scan on public.lineitem lineitem_1
                     Output: lineitem_1.l_partkey, lineitem_1.l_quantity

   ->  HashAggregate
         Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
         Group Key: lineitem_1.l_partkey
         ->  Seq Scan on public.lineitem lineitem_1
               Output: lineitem_1.l_orderkey, lineitem_1.l_partkey,
                       lineitem_1.l_suppkey, lineitem_1.l_linenumber,
                       lineitem_1.l_quantity, lineitem_1.l_extendedprice,
                       lineitem_1.l_discount, lineitem_1.l_tax,
                       lineitem_1.l_returnflag, lineitem_1.l_linestatus,
                       lineitem_1.l_shipdate, lineitem_1.l_commitdate,
                       lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct,
                       lineitem_1.l_shipmode, lineitem_1.l_comment

It seems that in the hashagg case we're not applying projection in the
seqscan, forcing us to serialize way much data (the whole lineitem
table, essentially).

It's probably still worth tweaking the I/O pattern, I think.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Thu, May 21, 2020 at 03:41:22PM +0200, Tomas Vondra wrote:
>On Tue, May 19, 2020 at 05:12:02PM +0200, Tomas Vondra wrote:
>>
>>...
>>
>>The problem is that the hashagg plan runs in ~1400 seconds, while the
>>groupagg only takes ~360. And per explain analyze, the difference really
>>is in the aggregation - if we subtract the seqscan, the sort+groupagg
>>takes about 310s:
>>
>>   ->  GroupAggregate  (cost=41772791.17..43305665.51 rows=6206695 width=36) (actual time=283378.004..335611.192
rows=6398981loops=1)
 
>>         Group Key: lineitem_1.l_partkey
>>         ->  Sort  (cost=41772791.17..42252715.81 rows=191969856 width=9) (actual time=283377.977..306182.393
rows=191969841loops=1)
 
>>               Sort Key: lineitem_1.l_partkey
>>               Sort Method: external merge  Disk: 3569544kB
>>               ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual
time=0.019..28253.076rows=192000551 loops=1)
 
>>
>>while the hashagg takes ~1330s:
>>
>>   ->  HashAggregate  (cost=13977751.34..15945557.39 rows=6206695 width=36) (actual time=202952.170..1354546.897
rows=6400000loops=1)
 
>>         Group Key: lineitem_1.l_partkey
>>         Planned Partitions: 128
>>         Peak Memory Usage: 4249 kB
>>         Disk Usage: 26321840 kB
>>         HashAgg Batches: 16512
>>         ->  Seq Scan on lineitem lineitem_1  (cost=0.00..5519079.56 rows=191969856 width=9) (actual
time=0.007..22205.617rows=192000551 loops=1)
 
>>
>>And that's while only writing 26GB, compared to 35GB in the sorted plan,
>>and with cost being ~16M vs. ~43M (so roughly inverse).
>>
>
>I've noticed I've actually made a mistake here - it's not 26GB vs. 35GB
>in hash vs. sort, it's 26GB vs. 3.5GB. That is, the sort-based plan
>writes out *way less* data to the temp file.
>
>The reason is revealed by explain verbose:
>
>  ->  GroupAggregate
>        Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
>        Group Key: lineitem_1.l_partkey
>        ->  Sort
>              Output: lineitem_1.l_partkey, lineitem_1.l_quantity
>              Sort Key: lineitem_1.l_partkey
>              ->  Seq Scan on public.lineitem lineitem_1
>                    Output: lineitem_1.l_partkey, lineitem_1.l_quantity
>
>  ->  HashAggregate
>        Output: lineitem_1.l_partkey, (0.2 * avg(lineitem_1.l_quantity))
>        Group Key: lineitem_1.l_partkey
>        ->  Seq Scan on public.lineitem lineitem_1
>              Output: lineitem_1.l_orderkey, lineitem_1.l_partkey,
>                      lineitem_1.l_suppkey, lineitem_1.l_linenumber,
>                      lineitem_1.l_quantity, lineitem_1.l_extendedprice,
>                      lineitem_1.l_discount, lineitem_1.l_tax,
>                      lineitem_1.l_returnflag, lineitem_1.l_linestatus,
>                      lineitem_1.l_shipdate, lineitem_1.l_commitdate,
>                      lineitem_1.l_receiptdate, lineitem_1.l_shipinstruct,
>                      lineitem_1.l_shipmode, lineitem_1.l_comment
>
>It seems that in the hashagg case we're not applying projection in the
>seqscan, forcing us to serialize way much data (the whole lineitem
>table, essentially).
>
>It's probably still worth tweaking the I/O pattern, I think.
>

OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST to
CP_SMALL_TLIST) addresses this for me. I've only tried it on the patched
version that pre-allocates 128 blocks, and the results seem pretty nice:

                sort      hash      hash+tlist
    ------------------------------------------
       4MB       331       478            188
     128MB       222       434            210

which I guess is what we wanted ...

I'll give it a try on the other machine (temp on SATA), but I don't see
why would it not behave similarly nicely.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Thu, May 21, 2020 at 02:12:55AM +0200, Tomas Vondra wrote:
>
> ...
>
>I agree that's pretty nice. I wonder how far would we need to go before
>reaching a plateau. I'll try this on the other machine with temporary
>tablespace on SATA, but that'll take longer.
>

OK, I've managed to get some numbers from the other machine, with 75GB
data set and temp tablespace on SATA RAID. I haven't collected I/O data
using iosnoop this time, because we already know how that changes from
the other machine. I've also only done this with 128MB work_mem, because
of how long a single run takes, and with 128 blocks pre-allocation.

The patched+tlist means both pre-allocation and with the tlist tweak
I've posted to this thread a couple minutes ago:

                master       patched       patched+tlist
    -----------------------------------------------------
     sort         485            472                 462
     hash       24686           3060                 559

So the pre-allocation makes it 10x faster, and the tlist tweak makes it
5x faster. Not bad, I guess.

Note: I've slightly tweaked read-ahead on the RAID device(s) on those
patched runs, but the effect was pretty negligible (compared to other
patched runs with the old read-ahead setting).


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Robert Haas
Date:
On Thu, May 21, 2020 at 10:45 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> So the pre-allocation makes it 10x faster, and the tlist tweak makes it
> 5x faster. Not bad, I guess.

That is pretty great stuff, Tomas.

FWIW, I agree that CP_SMALL_TLIST seems like the right thing here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Thu, 2020-05-21 at 16:30 +0200, Tomas Vondra wrote:
> OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST
> to
> CP_SMALL_TLIST) addresses this for me.

Great!

There were a couple plan changes where it introduced a Subquery Scan.
I'm not sure that I understand why it's doing that, can you verify that
it is a reasonable thing to do?

Aside from that, feel free to commit it.

Regards,
    Jeff Davis





Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Thu, May 21, 2020 at 11:19:01AM -0700, Jeff Davis wrote:
>On Thu, 2020-05-21 at 16:30 +0200, Tomas Vondra wrote:
>> OK, it seems the attached trivial fix (simply changing CP_LABEL_TLIST
>> to
>> CP_SMALL_TLIST) addresses this for me.
>
>Great!
>
>There were a couple plan changes where it introduced a Subquery Scan.
>I'm not sure that I understand why it's doing that, can you verify that
>it is a reasonable thing to do?
>
>Aside from that, feel free to commit it.
>

It's doing that because we're doing projection everywhere, even in cases
when it may not be necessary - but I think that's actually OK.

At first I thought we might only do it conditionally when we expect to
spill to disk, but that'd not work for cases when we only realize we
need to spill to disk during execution.

So I think the plan changes are correct and expected.

I think we should do the pre-allocation patch too. I haven't tried yet
but I believe the tlist fix alone won't do nearly as good.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Thu, May 21, 2020 at 08:34:05PM +0200, Tomas Vondra wrote:
>On Thu, May 21, 2020 at 11:19:01AM -0700, Jeff Davis wrote:
>
> ...
>
>I think we should do the pre-allocation patch too. I haven't tried yet
>but I believe the tlist fix alone won't do nearly as good.
>

I've done some measurements on the smaller (SSD) machine, and the
comparison looks like this:

             sort   hash   hash+prealloc+tlist   hash+tlist
    --------------------------------------------------------
       4MB    331    478                   188          330
     128MB    222    434                   210          350


The last column is master with the tlist tweak alone - it's better than
hashagg on master alone, but it's not nearly as good as with both tlist
and prealloc patches.

I can't test this on the larger box with SATA temporary tablespace at
the moment (other tests are running), but I believe the difference will
be even more pronounced there.

I don't think we're under a lot of pressure - beta1 is out anyway, so we
have time to do proper testing first.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Thu, 2020-05-21 at 20:54 +0200, Tomas Vondra wrote:
> The last column is master with the tlist tweak alone - it's better
> than
> hashagg on master alone, but it's not nearly as good as with both
> tlist
> and prealloc patches.

Right, I certainly think we should do the prealloc change, as well.

I'm tweaking the patch to be a bit more flexible. I'm thinking we
should start the preallocation list size ~8 and then double it up to
~128 (depending on your results). That would reduce the waste in case
we have a large number of small partitions.

Regards,
    Jeff Davis





Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Tue, May 19, 2020 at 09:15:40PM -0700, Jeff Davis wrote:
>On Tue, 2020-05-19 at 19:53 +0200, Tomas Vondra wrote:
>>
>> And if there a way to pre-allocate larger chunks? Presumably we could
>> assign the blocks to tape in larger chunks (e.g. 128kB, i.e. 16 x
>> 8kB)
>> instead of just single block. I haven't seen anything like that in
>> tape.c, though ...
>
>It turned out to be simple (at least a POC) so I threw together a
>patch. I just added a 32-element array of block numbers to each tape.
>When we need a new block, we retrieve a block number from that array;
>or if it's empty, we fill it by calling ltsGetFreeBlock() 32 times.
>

I think the PoC patch goes in the right direction. I have two ideas how
to improve it a bit:

1) Instead of assigning the pages one by one, we can easily extend the
API to allow getting a range of blocks, so that we don't need to call
ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
with the requested number of blocks. And we could keep just a min/max
of free blocks, not an array with fixed number of elements.

2) We could make it self-tuning, by increasing the number of blocks
we pre-allocate. So every time we exhaust the range, we double the
number of blocks (with a reasonable maximum, like 1024 or so). Or we
might just increment it by 32, or something.

IIUC the danger of pre-allocating blocks is that we might not fill them,
resulting in temp file much larger than necessary. It might be harmless
on some (most?) current filesystems that don't actually allocate space
for blocks that are never written, but it also confuses our accounting
of temporary file sizes. So we should try to limit that, and growing the
number of pre-allocated blocks over time seems reasonable.

Both (1) and (2) seem fairly simple, not much more complex than the
current PoC patch.

I also wonder if we could collect / report useful statistics about I/O
on the temporary file, not just the size. I mean, how many pages we've
written/read, how sequential it was, etc. But some of that is probably
only visible at the OS level (e.g. we have no insignt into how the
kernel combines writes in page cache, etc.). This is clearly matter for
v14, though.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Thu, May 21, 2020 at 12:04:19PM -0700, Jeff Davis wrote:
>On Thu, 2020-05-21 at 20:54 +0200, Tomas Vondra wrote:
>> The last column is master with the tlist tweak alone - it's better
>> than
>> hashagg on master alone, but it's not nearly as good as with both
>> tlist
>> and prealloc patches.
>
>Right, I certainly think we should do the prealloc change, as well.
>
>I'm tweaking the patch to be a bit more flexible. I'm thinking we
>should start the preallocation list size ~8 and then double it up to
>~128 (depending on your results). That would reduce the waste in case
>we have a large number of small partitions.
>

You're reading my mind ;-)

I don't think 128 is necessarily the maximum we should use - it's just
that I haven't tested higher values. I wouldn't be surprised if higher
values made it a bit faster. But we can test and tune that, I agree with
growing the number of pre-allocted blocks over time.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
> 1) Instead of assigning the pages one by one, we can easily extend
> the
> API to allow getting a range of blocks, so that we don't need to call
> ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
> with the requested number of blocks.

ltsGetFreeBlock() just draws one element from a minheap. Is there some
more efficient way to get many elements from a minheap at once?

>  And we could keep just a min/max
> of free blocks, not an array with fixed number of elements.

I don't quite know what you mean. Can you elaborate?

Regards,
    Jeff Davis





Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Thu, May 21, 2020 at 12:40:23PM -0700, Jeff Davis wrote:
>On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
>> 1) Instead of assigning the pages one by one, we can easily extend
>> the
>> API to allow getting a range of blocks, so that we don't need to call
>> ltsGetFreeBlock in a loop. Instead we could call ltsGetFreeBlockRange
>> with the requested number of blocks.
>
>ltsGetFreeBlock() just draws one element from a minheap. Is there some
>more efficient way to get many elements from a minheap at once?
>
>>  And we could keep just a min/max
>> of free blocks, not an array with fixed number of elements.
>
>I don't quite know what you mean. Can you elaborate?
>

Ah, I forgot there's and internal minheap thing - I thought we're just
incrementing some internal counter or something like that, but with the
minheap we can't just get a range of blocks. So just disregard that,
you're right we need the array.



regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
> 2) We could make it self-tuning, by increasing the number of blocks
> we pre-allocate. So every time we exhaust the range, we double the
> number of blocks (with a reasonable maximum, like 1024 or so). Or we
> might just increment it by 32, or something.

Attached a new version that uses the doubling behavior, and cleans it
up a bit. It also returns the unused prealloc blocks back to lts-
>freeBlocks when the tape is rewound for reading.

> IIUC the danger of pre-allocating blocks is that we might not fill
> them,
> resulting in temp file much larger than necessary. It might be
> harmless
> on some (most?) current filesystems that don't actually allocate
> space
> for blocks that are never written, but it also confuses our
> accounting
> of temporary file sizes. So we should try to limit that, and growing
> the
> number of pre-allocated blocks over time seems reasonable.

There's another danger here: it doesn't matter how well the filesystem
deals with sparse writes, because ltsWriteBlock fills in the holes with
zeros anyway. That's potentially a significant amount of wasted IO
effort if we aren't careful.

Regards,
    Jeff Davis


Attachment

Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Thu, May 21, 2020 at 02:16:37PM -0700, Jeff Davis wrote:
>On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
>> 2) We could make it self-tuning, by increasing the number of blocks
>> we pre-allocate. So every time we exhaust the range, we double the
>> number of blocks (with a reasonable maximum, like 1024 or so). Or we
>> might just increment it by 32, or something.
>
>Attached a new version that uses the doubling behavior, and cleans it
>up a bit. It also returns the unused prealloc blocks back to lts-
>freeBlocks when the tape is rewound for reading.
>

Ah, the returning is a nice idea, that should limit the overhead quite a
bit, I think.

>> IIUC the danger of pre-allocating blocks is that we might not fill
>> them,
>> resulting in temp file much larger than necessary. It might be
>> harmless
>> on some (most?) current filesystems that don't actually allocate
>> space
>> for blocks that are never written, but it also confuses our
>> accounting
>> of temporary file sizes. So we should try to limit that, and growing
>> the
>> number of pre-allocated blocks over time seems reasonable.
>
>There's another danger here: it doesn't matter how well the filesystem
>deals with sparse writes, because ltsWriteBlock fills in the holes with
>zeros anyway. That's potentially a significant amount of wasted IO
>effort if we aren't careful.
>

True. I'll give it a try on both machines and report some numbers. Might
take a couple of days.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Thu, May 21, 2020 at 11:41:22PM +0200, Tomas Vondra wrote:
>On Thu, May 21, 2020 at 02:16:37PM -0700, Jeff Davis wrote:
>>On Thu, 2020-05-21 at 21:13 +0200, Tomas Vondra wrote:
>>>2) We could make it self-tuning, by increasing the number of blocks
>>>we pre-allocate. So every time we exhaust the range, we double the
>>>number of blocks (with a reasonable maximum, like 1024 or so). Or we
>>>might just increment it by 32, or something.
>>
>>Attached a new version that uses the doubling behavior, and cleans it
>>up a bit. It also returns the unused prealloc blocks back to lts-
>>freeBlocks when the tape is rewound for reading.
>>
>
>Ah, the returning is a nice idea, that should limit the overhead quite a
>bit, I think.
>
>>>IIUC the danger of pre-allocating blocks is that we might not fill
>>>them,
>>>resulting in temp file much larger than necessary. It might be
>>>harmless
>>>on some (most?) current filesystems that don't actually allocate
>>>space
>>>for blocks that are never written, but it also confuses our
>>>accounting
>>>of temporary file sizes. So we should try to limit that, and growing
>>>the
>>>number of pre-allocated blocks over time seems reasonable.
>>
>>There's another danger here: it doesn't matter how well the filesystem
>>deals with sparse writes, because ltsWriteBlock fills in the holes with
>>zeros anyway. That's potentially a significant amount of wasted IO
>>effort if we aren't careful.
>>
>
>True. I'll give it a try on both machines and report some numbers. Might
>take a couple of days.
>

OK, so I do have some numbers to share. I think there's a clear
conclusion that the two patches are a huge improvement, but there's also
something fishy about planning of parallel queries.

Firstly, I have two machines that I used for testing:

1) small one: i5-2500k (4 cores), 8GB RAM, SSD RAID for data, SSD for
temporary tablespace, using TPC-H 32GB data set

2) big one: 2x xeon e5-2620v3 (8 cores), 64GB RAM, NVME SSD for data,
temporary tablespace on SATA RAID0 (3 x 7.2k), using TPC-H 75GB


serial queries (no parallelism)
===============================

Results with parallel query disabled on the two machines look like this:

1) small one (SSD)

     algorithm  master  prealloc  tlist  prealloc-tlist
     --------------------------------------------------
          hash    1365       437    368             213
          sort     226       214    224             215

The sort row simply means "enable_hashagg = off" and AFAIK the patches
should not have a lot of influence here - the prealloc does, but it's
fairly negligible.

It's not always exactly on part, I've seen cases where hash or sort were
a bit faster (probably depending on work_mem), but I think we can ignore
that for now.


2) big one (SATA)

     algorithm  master  tlist  prealloc  prealloc+tlist
     --------------------------------------------------
          hash   25534   5120      2402             540
          sort     460    460       465             485

The effect is even more pronounced, thanks to poor handling of random
I/O by the SATA RAID device. It's not exactly on par with sort, but it's
close enough ...


parallel queries
================

And now the fun begins ...


1) small one (SSD, max_parallel_workers_per_gather = 2)

     algorithm  master  tlist  prealloc  prealloc+tlist
     --------------------------------------------------
          hash   693      390       177             128
          sort   103       99       101              99

This looks pretty nice - the patches have the expected effect, it got
faster than with just a single CPU etc.


2) big one (SATA, max_parallel_workers_per_gather = 16)

     algorithm  master  tlist  prealloc  prealloc+tlist
     --------------------------------------------------
          hash       ?  25000         ?            3132
          sort     248    234       216             200

Well, not that nice :-( The hash queries take so much time that I've
decided not to wait for them and the two numbers are actually just
estimates (after processing just a couple of logical tapes).

Plus it actually gets slower than with serial execution, so what's the
problem here? Especially considering it worked OK on the small machine?

At first I thought it's something about SSD vs. SATA, but it seems to be
more about how we construct the plans, because the plans between the two
machines are very different. And it seems to be depend by the number of
workers per gather - for low number of workers the plan looks like this
(the plans are attached in plans.txt in case the formatting gets broken
by your client):


                                                       QUERY PLAN
     ---------------------------------------------------------------------------------------------------------------
      Limit
        ->  Aggregate
              ->  Hash Join
                    Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
                    Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
                    ->  Gather
                          Workers Planned: 2
                          ->  Nested Loop
                                ->  Parallel Seq Scan on part
                                      Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
                                ->  Index Scan using idx_lineitem_part_supp on lineitem
                                      Index Cond: (l_partkey = part.p_partkey)
                    ->  Hash
                          ->  Finalize HashAggregate
                                Group Key: lineitem_1.l_partkey
                                ->  Gather
                                      Workers Planned: 2
                                      ->  Partial HashAggregate
                                            Group Key: lineitem_1.l_partkey
                                            ->  Parallel Seq Scan on lineitem lineitem_1
     (20 rows)

but then if I crank the number of workers up, it switches to this:

                                                          QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
      Limit
        ->  Finalize Aggregate
              ->  Gather
                    Workers Planned: 5
                    ->  Partial Aggregate
                          ->  Nested Loop
                                Join Filter: (part.p_partkey = lineitem.l_partkey)
                                ->  Hash Join
                                      Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
                                      ->  Parallel Seq Scan on part
                                            Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG
BOX'::bpchar))
                                      ->  Hash
                                            ->  HashAggregate
                                                  Group Key: lineitem_1.l_partkey
                                                  ->  Seq Scan on lineitem lineitem_1
                                ->  Index Scan using idx_lineitem_part_supp on lineitem
                                      Index Cond: (l_partkey = lineitem_1.l_partkey)
                                      Filter: (l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
     (18 rows)


Notice that in the first plan, the hashagg is on top of parallel-aware
path - so each workers builds hashagg only on a subset of data, and also
spills only a fraction of the input rows (so that all workers combined
spill rouhly the "whole" table).

In the second plan, the hashagg is on the non-partitioned side of the
join, so each workers builds a hash aggregate on the *whole* set of
input rows. Which means that (a) we need much more disk space for temp
files, making it unlikely to fit into page cache and (b) there's a lot
of contention for I/O, making it much more random.

Now, I haven't seen the second plan with sort-based aggregation, no
matter how I set the number of workers it always looks like this:

                                                          QUERY PLAN

---------------------------------------------------------------------------------------------------------------------
      Limit
        ->  Aggregate
              ->  Merge Join
                    Merge Cond: (lineitem_1.l_partkey = part.p_partkey)
                    Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
                    ->  Finalize GroupAggregate
                          Group Key: lineitem_1.l_partkey
                          ->  Gather Merge
                                Workers Planned: 8
                                ->  Partial GroupAggregate
                                      Group Key: lineitem_1.l_partkey
                                      ->  Sort
                                            Sort Key: lineitem_1.l_partkey
                                            ->  Parallel Seq Scan on lineitem lineitem_1
                    ->  Materialize
                          ->  Gather Merge
                                Workers Planned: 6
                                ->  Nested Loop
                                      ->  Parallel Index Scan using part_pkey on part
                                            Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG
BOX'::bpchar))
                                      ->  Index Scan using idx_lineitem_part_supp on lineitem
                                            Index Cond: (l_partkey = part.p_partkey)
     (22 rows)

How come we don't have the same issue here? Is there something in the
optimizer that prevents us from creating the "silly" plans with
groupagg, and we should do the same thing for hashagg?


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Mon, May 25, 2020 at 04:10:45AM +0200, Tomas Vondra wrote:
>
> ...
>
>parallel queries
>================
>
>And now the fun begins ...
>
>
>1) small one (SSD, max_parallel_workers_per_gather = 2)
>
>    algorithm  master  tlist  prealloc  prealloc+tlist
>    --------------------------------------------------
>         hash   693      390       177             128
>         sort   103       99       101              99
>
>This looks pretty nice - the patches have the expected effect, it got
>faster than with just a single CPU etc.
>
>
>2) big one (SATA, max_parallel_workers_per_gather = 16)
>
>    algorithm  master  tlist  prealloc  prealloc+tlist
>    --------------------------------------------------
>         hash       ?  25000         ?            3132
>         sort     248    234       216             200
>
>Well, not that nice :-( The hash queries take so much time that I've
>decided not to wait for them and the two numbers are actually just
>estimates (after processing just a couple of logical tapes).
>
>Plus it actually gets slower than with serial execution, so what's the
>problem here? Especially considering it worked OK on the small machine?
>
>At first I thought it's something about SSD vs. SATA, but it seems to be
>more about how we construct the plans, because the plans between the two
>machines are very different. And it seems to be depend by the number of
>workers per gather - for low number of workers the plan looks like this
>(the plans are attached in plans.txt in case the formatting gets broken
>by your client):
>
>
>                                                      QUERY PLAN
>    ---------------------------------------------------------------------------------------------------------------
>     Limit
>       ->  Aggregate
>             ->  Hash Join
>                   Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
>                   Join Filter: (lineitem.l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
>                   ->  Gather
>                         Workers Planned: 2
>                         ->  Nested Loop
>                               ->  Parallel Seq Scan on part
>                                     Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG BOX'::bpchar))
>                               ->  Index Scan using idx_lineitem_part_supp on lineitem
>                                     Index Cond: (l_partkey = part.p_partkey)
>                   ->  Hash
>                         ->  Finalize HashAggregate
>                               Group Key: lineitem_1.l_partkey
>                               ->  Gather
>                                     Workers Planned: 2
>                                     ->  Partial HashAggregate
>                                           Group Key: lineitem_1.l_partkey
>                                           ->  Parallel Seq Scan on lineitem lineitem_1
>    (20 rows)
>
>but then if I crank the number of workers up, it switches to this:
>
>                                                         QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------
>     Limit
>       ->  Finalize Aggregate
>             ->  Gather
>                   Workers Planned: 5
>                   ->  Partial Aggregate
>                         ->  Nested Loop
>                               Join Filter: (part.p_partkey = lineitem.l_partkey)
>                               ->  Hash Join
>                                     Hash Cond: (part.p_partkey = lineitem_1.l_partkey)
>                                     ->  Parallel Seq Scan on part
>                                           Filter: ((p_brand = 'Brand#22'::bpchar) AND (p_container = 'LG
BOX'::bpchar))
>                                     ->  Hash
>                                           ->  HashAggregate
>                                                 Group Key: lineitem_1.l_partkey
>                                                 ->  Seq Scan on lineitem lineitem_1
>                               ->  Index Scan using idx_lineitem_part_supp on lineitem
>                                     Index Cond: (l_partkey = lineitem_1.l_partkey)
>                                     Filter: (l_quantity < ((0.2 * avg(lineitem_1.l_quantity))))
>    (18 rows)
>
>
>Notice that in the first plan, the hashagg is on top of parallel-aware
>path - so each workers builds hashagg only on a subset of data, and also
>spills only a fraction of the input rows (so that all workers combined
>spill rouhly the "whole" table).
>

OK, I've done an experiment and re-ran the test with

     max_parallel_workers_per_gather = 5

which is the highest value still giving the "good" plan, and the results
look like this:

            master    tlist    prealloc    prealloc+tlist
     ----------------------------------------------------
     hash    10535     1044        1723               407
     sort      198      196         192               219

which is obviously *way* better than the numbers with more workers:

>    algorithm  master  tlist  prealloc  prealloc+tlist
>    --------------------------------------------------
>         hash       ?  25000         ?            3132
>         sort     248    234       216             200

It's still ~2x slower than the sort, so presumably we'll need to tweak
the costing somehow. I do belive this is still due to differences in I/O
patterns, with parallel hashagg probably being a bit more random (I'm
deducing that from SSD not being affected by this).

I'd imagine this is because given the same work_mem value, sort tends to
create "sorted chunks" that are then merged into larger runs, making it
more sequential. OTOH hashagg likely makes it more random with smaller
work_mem values - more batches making it more interleaved / random.


This does not explain why we end up with the "bad" plans, though.

Attached are two files showing how the plan changes for different number
of workers per gather, both for groupagg and hashagg. For groupagg the
plan shape does not change at all, for hashagg it starts as "good" and
then between 5 and 6 switches to "bad" one.

There's another interesting thing I just noticed - as we increase the
number of workers, the cost estimate actually starts to grow at some
point:

    workers | plan cost
          0 | 23594267
     1 | 20155545
     2 | 19785306
     5 | 22718176 <-
     6 | 23063639
    10 | 22990594
    12 | 22972363

AFAIK this happens because we pick the number of workers simply based on
size of the input relation, which ignores the cost due to sending data
from workers to leaders (parallel_tuple_cost).  Which in this case is
quite significant, because each worker produces large number of groups.
I don't think this is causing the issue, though, because the sort plans
behave the same way. (I wonder if we could/should consider different
number of workers, somehow.)

We probably can't see these plans on 12 simply because hashagg would
need more memory than work_mem (especially in parallel mode), so we
simply reject them.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Mon, 2020-05-25 at 04:10 +0200, Tomas Vondra wrote:
>      algorithm  master  prealloc  tlist  prealloc-tlist
>      --------------------------------------------------
>           hash    1365       437    368             213
>           sort     226       214    224             215
> 
> The sort row simply means "enable_hashagg = off" and AFAIK the
> patches
> should not have a lot of influence here - the prealloc does, but it's
> fairly negligible.

I also say a small speedup from the prealloc patch for Sort. I wrote if
off initially, but I'm wondering if there's something going on there.
Perhaps drawing K elements from the minheap at once is better for
caching? If so, that's good news, because it means the prealloc list is
a win-win.

>                           ->  Finalize HashAggregate
>                                 Group Key: lineitem_1.l_partkey
>                                 ->  Gather
>                                       Workers Planned: 2
>                                       ->  Partial HashAggregate
>                                             Group Key:
> lineitem_1.l_partkey
>                                             ->  Parallel Seq Scan on
> lineitem lineitem_1
>      (20 rows)

Although each worker here only gets half the tuples, it will get
(approximately) all of the *groups*. This may partly explain why the
planner moves away from this plan when there are more workers: the
number of hashagg batches doesn't go down much with more workers.

It also might be interesting to know the estimate for the number of
groups relative to the size of the table. If those two are close, it
might look to the planner like processing the whole input in each
worker isn't much worse than processing all of the groups in each
worker.

Regards,
    Jeff Davis





Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Mon, 2020-05-25 at 14:17 +0200, Tomas Vondra wrote:
> It's still ~2x slower than the sort, so presumably we'll need to
> tweak
> the costing somehow.

One thing to think about is that the default random_page_cost is only
4X seq_page_cost. We know that's complete fiction, but it's meant to
paper over the OS caching effects. It seems like that shortcut may be
what's hurting us now.

HashAgg counts 1/2 of the page accesses as random, whereas Sort only
counts 1/4 as random. If the random_page_cost were closer to reality,
HashAgg would already be penalized substantially. It might be
interesting to test with higher values of random_page_cost and see what
the planner does.

If we want to be a bit more conservative, I'm fine with adding a
general penalty against a HashAgg that we expect to spill (multiply the
disk costs by some factor). We can consider removing the penalty in
v14.

> I do belive this is still due to differences in I/O
> patterns, with parallel hashagg probably being a bit more random (I'm
> deducing that from SSD not being affected by this).

Do you think the difference in IO patterns is due to a difference in
handling reads vs. writes in the kernel? Or do you think that 128
blocks is not enough to amortize the cost of a seek for that device?

Regards,
    Jeff Davis





Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Mon, May 25, 2020 at 11:36:42AM -0700, Jeff Davis wrote:
>On Mon, 2020-05-25 at 04:10 +0200, Tomas Vondra wrote:
>>      algorithm  master  prealloc  tlist  prealloc-tlist
>>      --------------------------------------------------
>>           hash    1365       437    368             213
>>           sort     226       214    224             215
>>
>> The sort row simply means "enable_hashagg = off" and AFAIK the
>> patches
>> should not have a lot of influence here - the prealloc does, but it's
>> fairly negligible.
>
>I also say a small speedup from the prealloc patch for Sort. I wrote if
>off initially, but I'm wondering if there's something going on there.
>Perhaps drawing K elements from the minheap at once is better for
>caching? If so, that's good news, because it means the prealloc list is
>a win-win.
>

True.


>>                           ->  Finalize HashAggregate
>>                                 Group Key: lineitem_1.l_partkey
>>                                 ->  Gather
>>                                       Workers Planned: 2
>>                                       ->  Partial HashAggregate
>>                                             Group Key:
>> lineitem_1.l_partkey
>>                                             ->  Parallel Seq Scan on
>> lineitem lineitem_1
>>      (20 rows)
>
>Although each worker here only gets half the tuples, it will get
>(approximately) all of the *groups*. This may partly explain why the
>planner moves away from this plan when there are more workers: the
>number of hashagg batches doesn't go down much with more workers.
>
>It also might be interesting to know the estimate for the number of
>groups relative to the size of the table. If those two are close, it
>might look to the planner like processing the whole input in each
>worker isn't much worse than processing all of the groups in each
>worker.
>

Hmmm, yeah. The number of groups per worker is another moving part. But
isn't the number of groups per worker pretty much the same (and equal to
the total number of groups) in all plans? I mean all the plans (both
hash and sort based) have this:

   ->  Finalize HashAggregate  (cost=18313351.98..18654393.43 rows=14949762 width=36)
         Group Key: lineitem_1.l_partkey
         Planned Partitions: 64
         ->  Gather  (cost=14993231.96..17967638.74 rows=14949762 width=36)
               Workers Planned: 1
               ->  Partial HashAggregate  (cost=14992231.96..16471662.54 rows=14949762 width=36)
                     Group Key: lineitem_1.l_partkey
                     Planned Partitions: 64
                     ->  Parallel Seq Scan on lineitem lineitem_1  (cost=0.00..11083534.91 rows=264715991 width=9)

I think it's rather that we actually expect this number of rows from
each worker, so the total cost is

     parallel_tuple_cost * num_of_workers * num_of_groups

But we just ignore this inherent cost when picking the number of
workers, don't we? Because we don't know how many rows will be produced
and passed to the Gather in the end.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Mon, May 25, 2020 at 12:49:45PM -0700, Jeff Davis wrote:
>On Mon, 2020-05-25 at 14:17 +0200, Tomas Vondra wrote:
>> It's still ~2x slower than the sort, so presumably we'll need to
>> tweak
>> the costing somehow.
>
>One thing to think about is that the default random_page_cost is only
>4X seq_page_cost. We know that's complete fiction, but it's meant to
>paper over the OS caching effects. It seems like that shortcut may be
>what's hurting us now.
>
>HashAgg counts 1/2 of the page accesses as random, whereas Sort only
>counts 1/4 as random. If the random_page_cost were closer to reality,
>HashAgg would already be penalized substantially. It might be
>interesting to test with higher values of random_page_cost and see what
>the planner does.
>

Hmmm, good point. I've tried bumping the random_page_cost up (to 40) and
the bogus plans disappeared - instead, the parallel plans switch from
hashagg to groupagg at 4 workers. There's a lot of parameters affecting
this, though (e.g. higher work_mem -> more workers use hashagg).

One thing annoys me, though. The costing just ignores costs set on the
(temporary) tablespace, which confused me at first because

     ALTER TABLESPACE temp SET (random_page_cost = 40);

had absolutely no effect. It's not the fault of this patch and it's
actually understandable (we can have multiple temp tablespaces, and we 
don't know which one will end up being used).

I wonder if it would be a good idea to have "temp" version of those cost
variables, applied to all temporary tablespaces ... Currently I have to
do the inverse thing - tweak this for all regular tablespaces.

>If we want to be a bit more conservative, I'm fine with adding a
>general penalty against a HashAgg that we expect to spill (multiply the
>disk costs by some factor). We can consider removing the penalty in
>v14.
>

Not sure, I need to look at the costing a bit.

I see all page writes are random while all page reads are sequential.
Shouldn't this consider the number of tapes and that we pre-allocate
blocks?  Essentially, We split the whole file into chunks of 128 blocks
that can be read sequentially, and switch between blocks are random.

Not sure if that's necessary, though. If increasing random_page_cost
pushes the plans to groupagg, then maybe that's good enough.

Let's push the two fixes that we already have. These extra questions
clearly need more investigation and testing, and I'm not even sure it's
something we should pursue for v13.

>> I do belive this is still due to differences in I/O
>> patterns, with parallel hashagg probably being a bit more random (I'm
>> deducing that from SSD not being affected by this).
>
>Do you think the difference in IO patterns is due to a difference in
>handling reads vs. writes in the kernel? Or do you think that 128
>blocks is not enough to amortize the cost of a seek for that device?
>

I don't know. I kinda imagined it was due to the workers interfering
with each other, but that should affect the sort the same way, right?
I don't have any data to support this, at the moment - I can repeat
the iosnoop tests and analyze the data, of course.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Thomas Munro
Date:
On Tue, May 26, 2020 at 10:59 AM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Mon, May 25, 2020 at 12:49:45PM -0700, Jeff Davis wrote:
> >Do you think the difference in IO patterns is due to a difference in
> >handling reads vs. writes in the kernel? Or do you think that 128
> >blocks is not enough to amortize the cost of a seek for that device?
>
> I don't know. I kinda imagined it was due to the workers interfering
> with each other, but that should affect the sort the same way, right?
> I don't have any data to support this, at the moment - I can repeat
> the iosnoop tests and analyze the data, of course.

About the reads vs writes question:  I know that reading and writing
two interleaved sequential "streams" through the same fd confuses the
read-ahead/write-behind heuristics on FreeBSD UFS (I mean: w(1),
r(42), w(2), r(43), w(3), r(44), ...) so the performance is terrible
on spinning media.  Andrew Gierth reported that as a problem for
sequential scans that are also writing back hint bits, and vacuum.
However, in a quick test on a Linux 4.19 XFS system, using a program
to generate interleaving read and write streams 1MB apart, I could see
that it was still happily generating larger clustered I/Os.  I have no
clue for other operating systems.  That said, even on Linux, reads and
writes still have to compete for scant IOPS on slow-seek media (albeit
hopefully in larger clustered I/Os)...

Jumping over large interleaving chunks with no prefetching from other
tapes *must* produce stalls though... and if you crank up the read
ahead size to be a decent percentage of the contiguous chunk size, I
guess you must also waste I/O bandwidth on unwanted data past the end
of each chunk, no?

In an off-list chat with Jeff about whether Hash Join should use
logtape.c for its partitions too, the first thought I had was that to
be competitive with separate files, perhaps you'd need to write out a
list of block ranges for each tape (rather than just next pointers on
each block), so that you have the visibility required to control
prefetching explicitly.  I guess that would be a bit like the list of
physical extents that Linux commands like filefrag(8) and xfs_bmap(8)
can show you for regular files.  (Other thoughts included worrying
about how to make it allocate and stream blocks in parallel queries,
...!?#$)



Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Tue, May 26, 2020 at 05:02:41PM +1200, Thomas Munro wrote:
>On Tue, May 26, 2020 at 10:59 AM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>> On Mon, May 25, 2020 at 12:49:45PM -0700, Jeff Davis wrote:
>> >Do you think the difference in IO patterns is due to a difference in
>> >handling reads vs. writes in the kernel? Or do you think that 128
>> >blocks is not enough to amortize the cost of a seek for that device?
>>
>> I don't know. I kinda imagined it was due to the workers interfering
>> with each other, but that should affect the sort the same way, right?
>> I don't have any data to support this, at the moment - I can repeat
>> the iosnoop tests and analyze the data, of course.
>
>About the reads vs writes question:  I know that reading and writing
>two interleaved sequential "streams" through the same fd confuses the
>read-ahead/write-behind heuristics on FreeBSD UFS (I mean: w(1),
>r(42), w(2), r(43), w(3), r(44), ...) so the performance is terrible
>on spinning media.  Andrew Gierth reported that as a problem for
>sequential scans that are also writing back hint bits, and vacuum.
>However, in a quick test on a Linux 4.19 XFS system, using a program
>to generate interleaving read and write streams 1MB apart, I could see
>that it was still happily generating larger clustered I/Os.  I have no
>clue for other operating systems.  That said, even on Linux, reads and
>writes still have to compete for scant IOPS on slow-seek media (albeit
>hopefully in larger clustered I/Os)...
>

True. I've repeated the tests with collection of iosnoop stats, both for
the good (partitioned hashagg) and bad (complete hashagg in each worker)
plans. Ignoring the fact that the bad plan does much more I/O in general
(about 32GB write + 35GB reads vs. 5.4GB + 7.6GB), request size stats
are almost exactly the same:

1) good plan (partitioned hashagg, ~180 seconds)

      type |  bytes  | count |  pct
     ------+---------+-------+-------
      RA   |  131072 | 39392 | 71.62
      RA   |    8192 |  5666 | 10.30
      RA   |   16384 |  3080 |  5.60
      RA   |   32768 |  2888 |  5.25
      RA   |   65536 |  2870 |  5.22
      RA   |  262144 |   710 |  1.29
      W    | 1310720 |  3138 | 32.01
      W    |  360448 |   633 |  6.46
      W    |  688128 |   538 |  5.49
      W    |  692224 |   301 |  3.07
      W    |  364544 |   247 |  2.52
      W    |  696320 |   182 |  1.86
      W    |    8192 |   164 |  1.67
      W    |  700416 |   116 |  1.18
      W    |  368640 |   102 |  1.04

2) bad plan (complete hashagg, ~500 seconds)

      type |  bytes  | count  |  pct
     ------+---------+--------+--------
      RA   |  131072 | 258924 |  68.54
      RA   |    8192 |  31357 |   8.30
      RA   |   16384 |  27596 |   7.31
      RA   |   32768 |  26434 |   7.00
      RA   |   65536 |  26415 |   6.99
      RM   |    4096 |    532 | 100.00
      W    | 1310720 |  15346 |  34.64
      W    |    8192 |    911 |   2.06
      W    |  360448 |    816 |   1.84
      W    |   16384 |    726 |   1.64
      W    |  688128 |    545 |   1.23
      W    |   32768 |    544 |   1.23
      W    |   40960 |    486 |   1.10
      W    |  524288 |    457 |   1.03

So in both cases, majority of read requests (~70%) are 128kB, with
additional ~25% happening in request larger than 8kB. In terms of I/O,
that's more than 90% of read I/O.

There is some difference in the "I/O delta" stats, showing how far the
queued I/O requests are. The write stats look almost exactly the same,
but for reads it looks like this:

1) good plan

      type | block_delta | count |  pct  
     ------+-------------+-------+-------
      RA   |         256 |  7555 | 13.74
      RA   |          64 |  2297 |  4.18
      RA   |          32 |   685 |  1.25
      RA   |         128 |   613 |  1.11
      RA   |          16 |   612 |  1.11

2) bad plans

      type | block_delta | count |  pct
     ------+-------------+-------+-------
      RA   |          64 | 18817 |  4.98
      RA   |       30480 |  9778 |  2.59
      RA   |         256 |  9437 |  2.50

Ideally this should match the block size stats (it's in sectors, so 256
is 128kB). Unfortunately this does not work all that great - even for
the "good" plan it's only about 14% vs. 70% (of 128kB blocks). In the
serial plan (disabled parallelism) this was ~70% vs. 75%, much closer.

Anyway, I think this shows that the read-ahead works pretty well even
with multiple workers - otherwise there wouldn't be that many 128kB
requests. The poor correlation with 128kB deltas is unfortunate, but I
don't think we can really fix that.

This was on linux (5.6.0) with ext4, but I don't think the filesystem
matters that much - the read-ahead happens in page cache I think.

>Jumping over large interleaving chunks with no prefetching from other
>tapes *must* produce stalls though... and if you crank up the read
>ahead size to be a decent percentage of the contiguous chunk size, I
>guess you must also waste I/O bandwidth on unwanted data past the end
>of each chunk, no?
>
>In an off-list chat with Jeff about whether Hash Join should use
>logtape.c for its partitions too, the first thought I had was that to
>be competitive with separate files, perhaps you'd need to write out a
>list of block ranges for each tape (rather than just next pointers on
>each block), so that you have the visibility required to control
>prefetching explicitly.  I guess that would be a bit like the list of
>physical extents that Linux commands like filefrag(8) and xfs_bmap(8)
>can show you for regular files.  (Other thoughts included worrying
>about how to make it allocate and stream blocks in parallel queries,
>...!?#$)

I was wondering how useful would it be to do explicit prefetch too.

I'm not familiar with logtape internals but IIRC the blocks are linked
by each block having a pointer to the prev/next block, which means we
can't prefetch more than one block ahead I think. But maybe I'm wrong,
or maybe fetching even just one block ahead would help ...

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Tue, 2020-05-26 at 16:15 +0200, Tomas Vondra wrote:
> I'm not familiar with logtape internals but IIRC the blocks are
> linked
> by each block having a pointer to the prev/next block, which means we
> can't prefetch more than one block ahead I think. But maybe I'm
> wrong,
> or maybe fetching even just one block ahead would help ...

We'd have to get creative. Keeping a directory in the LogicalTape
structure might work, but I'm worried the memory requirements would be
too high.

One idea is to add a "prefetch block" to the TapeBlockTrailer (perhaps
only in the forward direction?). We could modify the prealloc list so
that we always know the next K blocks that will be allocated to the
tape. All for v14, of course, but I'd be happy to hack together a
prototype to collect data.

Do you have any other thoughts on the current prealloc patch for v13,
or is it about ready for commit?

Regards,
    Jeff Davis





Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Tue, May 26, 2020 at 11:40:07AM -0700, Jeff Davis wrote:
>On Tue, 2020-05-26 at 16:15 +0200, Tomas Vondra wrote:
>> I'm not familiar with logtape internals but IIRC the blocks are
>> linked
>> by each block having a pointer to the prev/next block, which means we
>> can't prefetch more than one block ahead I think. But maybe I'm
>> wrong,
>> or maybe fetching even just one block ahead would help ...
>
>We'd have to get creative. Keeping a directory in the LogicalTape
>structure might work, but I'm worried the memory requirements would be
>too high.
>
>One idea is to add a "prefetch block" to the TapeBlockTrailer (perhaps
>only in the forward direction?). We could modify the prealloc list so
>that we always know the next K blocks that will be allocated to the
>tape. All for v14, of course, but I'd be happy to hack together a
>prototype to collect data.
>

Yeah. I agree prefetching is definitely out of v13 scope. It might be
interesting to try how useful would it be, if you're willing to spend
some time on a prototype.

>
>Do you have any other thoughts on the current prealloc patch for v13,
>or is it about ready for commit?
>

I think it's pretty much ready to go.

I have some some doubts about the maximum value (128 probably means
read-ahead values above 256 are probably pointless, although I have not
tested that). But it's still a huge improvement with 128, so let's get
that committed.

I've been thinking about actually computing the expected number of
blocks per tape, and tying the maximum to that, somehow. But that's
something we can look at in the future.

As for the tlist fix, I think that's mostly ready too - the one thing we
should do is probably only doing it for AGG_HASHED. For AGG_SORTED it's
not really necessary.


iregards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Tue, 2020-05-26 at 21:15 +0200, Tomas Vondra wrote:
> Yeah. I agree prefetching is definitely out of v13 scope. It might be
> interesting to try how useful would it be, if you're willing to spend
> some time on a prototype.

I think a POC would be pretty quick; I'll see if I can hack something
together.

> I think it's pretty much ready to go.

Committed with max of 128 preallocated blocks. Minor revisions.

> 
> As for the tlist fix, I think that's mostly ready too - the one thing
> we
> should do is probably only doing it for AGG_HASHED. For AGG_SORTED
> it's
> not really necessary.

Melanie previously posted a patch to avoid spilling unneeded columns,
but it introduced more code:


https://www.postgresql.org/message-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com

and it seems that Heikki also looked at it. Perhaps we should get an
acknowledgement from one of them that your one-line change is the right
approach?

Regards,
    Jeff Davis





Re: Trouble with hashagg spill I/O pattern and costing

From
Melanie Plageman
Date:

On Tue, May 26, 2020 at 5:40 PM Jeff Davis <pgsql@j-davis.com> wrote:
On Tue, 2020-05-26 at 21:15 +0200, Tomas Vondra wrote:
>
> As for the tlist fix, I think that's mostly ready too - the one thing
> we
> should do is probably only doing it for AGG_HASHED. For AGG_SORTED
> it's
> not really necessary.

Melanie previously posted a patch to avoid spilling unneeded columns,
but it introduced more code:


https://www.postgresql.org/message-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com

and it seems that Heikki also looked at it. Perhaps we should get an
acknowledgement from one of them that your one-line change is the right
approach?


I spent some time looking at it today, and, it turns out I was wrong.

I thought that there was a case I had found where CP_SMALL_TLIST did not
eliminate as many columns as could be eliminated for the purposes of
spilling, but, that turned out not to be the case.

I changed CP_LABEL_TLIST to CP_SMALL_TLIST in
create_groupingsets_plan(), create_agg_plan(), etc and tried a bunch of
different queries and this 2-3 line change worked for all the cases I
tried. Is that where you made the change?
And then are you proposing to set it based on the aggstrategy to either
CP_LABEL_TLIST or CP_SMALL_TLIST here?

--
Melanie Plageman

Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Tue, May 26, 2020 at 06:42:50PM -0700, Melanie Plageman wrote:
>On Tue, May 26, 2020 at 5:40 PM Jeff Davis <pgsql@j-davis.com> wrote:
>
>> On Tue, 2020-05-26 at 21:15 +0200, Tomas Vondra wrote:
>> >
>> > As for the tlist fix, I think that's mostly ready too - the one thing
>> > we
>> > should do is probably only doing it for AGG_HASHED. For AGG_SORTED
>> > it's
>> > not really necessary.
>>
>> Melanie previously posted a patch to avoid spilling unneeded columns,
>> but it introduced more code:
>>
>>
>>
>> https://www.postgresql.org/message-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com
>>
>> and it seems that Heikki also looked at it. Perhaps we should get an
>> acknowledgement from one of them that your one-line change is the right
>> approach?
>>
>>
>I spent some time looking at it today, and, it turns out I was wrong.
>
>I thought that there was a case I had found where CP_SMALL_TLIST did not
>eliminate as many columns as could be eliminated for the purposes of
>spilling, but, that turned out not to be the case.
>
>I changed CP_LABEL_TLIST to CP_SMALL_TLIST in
>create_groupingsets_plan(), create_agg_plan(), etc and tried a bunch of
>different queries and this 2-3 line change worked for all the cases I
>tried. Is that where you made the change?

I've only made the change in create_agg_plan, because that's what was in
the query plan I was investigating. You may be right that the same fix
is needed in additional places, though.

>And then are you proposing to set it based on the aggstrategy to either
>CP_LABEL_TLIST or CP_SMALL_TLIST here?
>

Yes, something like that. The patch I shared on on 5/21 just changed
that, but I'm wondering if that could add overhead for sorted
aggregation, which already does the projection thanks to the sort.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Wed, May 27, 2020 at 11:07:04AM +0200, Tomas Vondra wrote:
>On Tue, May 26, 2020 at 06:42:50PM -0700, Melanie Plageman wrote:
>>On Tue, May 26, 2020 at 5:40 PM Jeff Davis <pgsql@j-davis.com> wrote:
>>
>>>On Tue, 2020-05-26 at 21:15 +0200, Tomas Vondra wrote:
>>>>
>>>> As for the tlist fix, I think that's mostly ready too - the one thing
>>>> we
>>>> should do is probably only doing it for AGG_HASHED. For AGG_SORTED
>>>> it's
>>>> not really necessary.
>>>
>>>Melanie previously posted a patch to avoid spilling unneeded columns,
>>>but it introduced more code:
>>>
>>>
>>>
>>>https://www.postgresql.org/message-id/CAAKRu_aefEsv+UkQWqu+ioEnoiL2LJu9Diuh9BR8MbyXuZ0j4A@mail.gmail.com
>>>
>>>and it seems that Heikki also looked at it. Perhaps we should get an
>>>acknowledgement from one of them that your one-line change is the right
>>>approach?
>>>
>>>
>>I spent some time looking at it today, and, it turns out I was wrong.
>>
>>I thought that there was a case I had found where CP_SMALL_TLIST did not
>>eliminate as many columns as could be eliminated for the purposes of
>>spilling, but, that turned out not to be the case.
>>
>>I changed CP_LABEL_TLIST to CP_SMALL_TLIST in
>>create_groupingsets_plan(), create_agg_plan(), etc and tried a bunch of
>>different queries and this 2-3 line change worked for all the cases I
>>tried. Is that where you made the change?
>
>I've only made the change in create_agg_plan, because that's what was in
>the query plan I was investigating. You may be right that the same fix
>is needed in additional places, though.
>

Attached is a patch adding CP_SMALL_TLIST to create_agg_plan and
create_groupingsets_plan. I've looked at the other places that I think
seem like they might benefit from it (create_upper_unique_plan,
create_group_plan) but I think we don't need to modify those - there'll
either be a Sort of HashAgg, which will take care about the projection.

Or do you think some other places need CP_SMALL_TLIST?


>>And then are you proposing to set it based on the aggstrategy to either
>>CP_LABEL_TLIST or CP_SMALL_TLIST here?
>>
>
>Yes, something like that. The patch I shared on on 5/21 just changed
>that, but I'm wondering if that could add overhead for sorted
>aggregation, which already does the projection thanks to the sort.
>

I ended up tweaking the tlist only for AGG_MIXED and AGG_HASHED. We
clearly don't need it for AGG_PLAIN or AGG_SORTED. This way we don't
break regression tests by adding unnecessary "Subquery Scan" nodes.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Tue, 2020-05-26 at 17:40 -0700, Jeff Davis wrote:
> On Tue, 2020-05-26 at 21:15 +0200, Tomas Vondra wrote:
> > Yeah. I agree prefetching is definitely out of v13 scope. It might
> > be
> > interesting to try how useful would it be, if you're willing to
> > spend
> > some time on a prototype.
> 
> I think a POC would be pretty quick; I'll see if I can hack something
> together.

Attached (intended for v14).

I changed the list from a simple array to a circular buffer so that we
can keep enough preallocated block numbers in it to do prefetching.

On SSD I didn't see any improvement, but perhaps it will do better on
magnetic storage.

Regards,
    Jeff Davis


Attachment

Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Thu, 2020-05-28 at 20:57 +0200, Tomas Vondra wrote:
> Attached is a patch adding CP_SMALL_TLIST to create_agg_plan and
> create_groupingsets_plan.

Looks good, except one question:

Why would aggstrategy ever be MIXED when in create_agg_path? Wouldn't
that only happen in create_groupingsets_path?

Regards,
    Jeff Davis





Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Thu, May 28, 2020 at 06:14:55PM -0700, Jeff Davis wrote:
>On Thu, 2020-05-28 at 20:57 +0200, Tomas Vondra wrote:
>> Attached is a patch adding CP_SMALL_TLIST to create_agg_plan and
>> create_groupingsets_plan.
>
>Looks good, except one question:
>
>Why would aggstrategy ever be MIXED when in create_agg_path? Wouldn't
>that only happen in create_groupingsets_path?
>

Ah, right. Yeah, we only need to check for AGG_HASH here. Moreover,
AGG_MIXED probably does not need the tlist tweak, because the input
should be pre-sorted as with AGG_SORTED.

And we should probably do similar check in the create_groupinsets_path,
I guess. At first I thought we can't do that before inspecting rollups,
which only happens later in the function, but now I see there's
aggstrategy in GroupingSetsPath too.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

Re: Trouble with hashagg spill I/O pattern and costing

From
Jeff Davis
Date:
On Fri, 2020-05-29 at 15:04 +0200, Tomas Vondra wrote:
> Ah, right. Yeah, we only need to check for AGG_HASH here. Moreover,
> AGG_MIXED probably does not need the tlist tweak, because the input
> should be pre-sorted as with AGG_SORTED.
> 
> And we should probably do similar check in the
> create_groupinsets_path,
> I guess. At first I thought we can't do that before inspecting
> rollups,
> which only happens later in the function, but now I see there's
> aggstrategy in GroupingSetsPath too.

Looks good.

    Jeff





Re: Trouble with hashagg spill I/O pattern and costing

From
Alvaro Herrera
Date:
Hello

Is this patch the only thing missing before this open item can be
considered closed?

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Fri, Jun 05, 2020 at 05:19:43PM -0400, Alvaro Herrera wrote:
>Hello
>
>Is this patch the only thing missing before this open item can be
>considered closed?
>

I've already pushed this as 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0,
sorry for not mentioning it in this thread explicitly.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services 



Re: Trouble with hashagg spill I/O pattern and costing

From
Alvaro Herrera
Date:
On 2020-Jun-06, Tomas Vondra wrote:

> On Fri, Jun 05, 2020 at 05:19:43PM -0400, Alvaro Herrera wrote:
>
> > Is this patch the only thing missing before this open item can be
> > considered closed?
> 
> I've already pushed this as 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0,
> sorry for not mentioning it in this thread explicitly.

That's great to know, thanks.  The other bit necessary to answer my
question is whether do we need to do anything else in this area -- if
no, then we can mark the open item as closed:
https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#Open_Issues

Thanks

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Trouble with hashagg spill I/O pattern and costing

From
Tomas Vondra
Date:
On Fri, Jun 05, 2020 at 06:51:34PM -0400, Alvaro Herrera wrote:
>On 2020-Jun-06, Tomas Vondra wrote:
>
>> On Fri, Jun 05, 2020 at 05:19:43PM -0400, Alvaro Herrera wrote:
>>
>> > Is this patch the only thing missing before this open item can be
>> > considered closed?
>>
>> I've already pushed this as 4cad2534da6d17067d98cf04be2dfc1bda8f2cd0,
>> sorry for not mentioning it in this thread explicitly.
>
>That's great to know, thanks.  The other bit necessary to answer my
>question is whether do we need to do anything else in this area -- if
>no, then we can mark the open item as closed:
>https://wiki.postgresql.org/wiki/PostgreSQL_13_Open_Items#Open_Issues
>

Hmmm, good question.

There was some discussion about maybe tweaking the costing model to make
it a bit more pessimistic (assuming more random I/O or something like
that), but I'm not sure it's still needed. Increasing random_page_cost
for the temp tablespace did the trick for me.

So I'd say we can mark it as closed ...


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services