Re: Trouble with hashagg spill I/O pattern and costing - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: Trouble with hashagg spill I/O pattern and costing
Date
Msg-id 20200525225909.aifxjm53hw4xyzxd@development
Whole thread Raw
In response to Re: Trouble with hashagg spill I/O pattern and costing  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: Trouble with hashagg spill I/O pattern and costing
List pgsql-hackers
On Mon, May 25, 2020 at 12:49:45PM -0700, Jeff Davis wrote:
>On Mon, 2020-05-25 at 14:17 +0200, Tomas Vondra wrote:
>> It's still ~2x slower than the sort, so presumably we'll need to
>> tweak
>> the costing somehow.
>
>One thing to think about is that the default random_page_cost is only
>4X seq_page_cost. We know that's complete fiction, but it's meant to
>paper over the OS caching effects. It seems like that shortcut may be
>what's hurting us now.
>
>HashAgg counts 1/2 of the page accesses as random, whereas Sort only
>counts 1/4 as random. If the random_page_cost were closer to reality,
>HashAgg would already be penalized substantially. It might be
>interesting to test with higher values of random_page_cost and see what
>the planner does.
>

Hmmm, good point. I've tried bumping the random_page_cost up (to 40) and
the bogus plans disappeared - instead, the parallel plans switch from
hashagg to groupagg at 4 workers. There's a lot of parameters affecting
this, though (e.g. higher work_mem -> more workers use hashagg).

One thing annoys me, though. The costing just ignores costs set on the
(temporary) tablespace, which confused me at first because

     ALTER TABLESPACE temp SET (random_page_cost = 40);

had absolutely no effect. It's not the fault of this patch and it's
actually understandable (we can have multiple temp tablespaces, and we 
don't know which one will end up being used).

I wonder if it would be a good idea to have "temp" version of those cost
variables, applied to all temporary tablespaces ... Currently I have to
do the inverse thing - tweak this for all regular tablespaces.

>If we want to be a bit more conservative, I'm fine with adding a
>general penalty against a HashAgg that we expect to spill (multiply the
>disk costs by some factor). We can consider removing the penalty in
>v14.
>

Not sure, I need to look at the costing a bit.

I see all page writes are random while all page reads are sequential.
Shouldn't this consider the number of tapes and that we pre-allocate
blocks?  Essentially, We split the whole file into chunks of 128 blocks
that can be read sequentially, and switch between blocks are random.

Not sure if that's necessary, though. If increasing random_page_cost
pushes the plans to groupagg, then maybe that's good enough.

Let's push the two fixes that we already have. These extra questions
clearly need more investigation and testing, and I'm not even sure it's
something we should pursue for v13.

>> I do belive this is still due to differences in I/O
>> patterns, with parallel hashagg probably being a bit more random (I'm
>> deducing that from SSD not being affected by this).
>
>Do you think the difference in IO patterns is due to a difference in
>handling reads vs. writes in the kernel? Or do you think that 128
>blocks is not enough to amortize the cost of a seek for that device?
>

I don't know. I kinda imagined it was due to the workers interfering
with each other, but that should affect the sort the same way, right?
I don't have any data to support this, at the moment - I can repeat
the iosnoop tests and analyze the data, of course.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: factorial function/phase out postfix operators?
Next
From: Alvaro Herrera
Date:
Subject: hash join error improvement (old)