Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H
Date
Msg-id 55857F04.9020007@2ndquadrant.com
Whole thread Raw
In response to Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H  (Feng Tian <ftian@vitessedata.com>)
Responses Re: pretty bad n_distinct estimate, causing HashAgg OOM on TPC-H  (Feng Tian <ftian@vitessedata.com>)
List pgsql-hackers
Hi,

On 06/20/2015 08:54 AM, Feng Tian wrote:
>
> While better sample/stats is important for choosing a good plan, in
> this query, hash agg is really the right plan. If a sort agg is
> chosen, the performance will be really really bad. The patch that
> Jeff is working on is critical for a decent TPCH number (unless you
> have unlimited amount of memory).

I do agree that Jeff's memory-bounded hashagg patch is very important 
feature, and in fact we spent a fair amount of time discussing it in 
Ottawa. So I'm looking forward to getting that soon ;-)

But I don't think hashagg is going to be very good in this particular 
case. With a 3TB dataset, the query runs out of memory on a machine with 
256GB of RAM. So let's assume a complete hash table has ~256GB. With 
work_mem=1GB that means only ~1/256 of the table can be processed in one 
batch, so we'll process the first 1/256 of the table, and write out the 
remaining 99% into batches. Then we'll read the batches one by one, and 
process those. The table has ~2.5TB, so we'll read 2.5TB, write out 
~2.49TB into batches, and then read those ~2.49TB again. At least that's 
how I understand Jeff's memory-bounded hashagg proposal.

The sort may perform worse in the general case, but in this case there's 
an index on the column, and the table is almost perfectly correlated by 
that column (due to generating the orders one by one, but it seems 
plausible it'd be the same in reality, assuming the orders are numbered 
using a sequence). So doing the sort by an indexscan seems rather cheap, 
and you only need to scan the table once.

regards

--
Tomas Vondra                   http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: pg_stat_*_columns?
Next
From: Tom Lane
Date:
Subject: Re: pg_stat_*_columns?