Thread: [GENERAL] How does Postgres estimate the memory needed for sorting/aggregating

[GENERAL] How does Postgres estimate the memory needed for sorting/aggregating

From

Thomas Kellerer

Date:

25 January 2017, 23:59:25

There was a question on dba.stackexchange recently:

    http://dba.stackexchange.com/a/162117/1822

That question (and the answer) deals with performance difference of a query caused by the _declared_ length of a
VARCHARcolumn in SQL Server (everything else being equal - especially the actual data length) 

For the curios: it does make a (big) difference in performance if you declare varchar(100) or varchar(2000) in SQL
Server- something that really surprised me. 

The difference in performance in SQL Servers seems to be caused by SQL Server's optimizer that uses the _declared_
lengthof a column to estimate the memory needed for the aggregation (or sorting). 

Now, we all know that there is no performance difference whatsoever for varchar columns regardless of the declared
length.

In one of the comments, to that answer the question was asked how Postgres knows how much memory it needs to allocate
todo the aggregation. 

I guess this is based on the column statistics stored in pg_stats, but I am not sure:

So here is my question: how does Postgres estimate/know the memory needed for the aggregation? Or does it dynamically
resizethe memory if the initial assumption was wrong? 

Thomas

Re: [GENERAL] How does Postgres estimate the memory needed forsorting/aggregating

From

Tomas Vondra

Date:

26 January 2017, 00:46:56

On 01/25/2017 09:59 PM, Thomas Kellerer wrote:
> There was a question on dba.stackexchange recently:
>
>    http://dba.stackexchange.com/a/162117/1822
>
> That question (and the answer) deals with performance difference of a
> query caused by the _declared_ length of a VARCHAR column in SQL Server
> (everything else being equal - especially the actual data length)
>
> For the curios: it does make a (big) difference in performance if you
> declare varchar(100) or varchar(2000) in SQL Server - something that
> really surprised me.
>
> The difference in performance in SQL Servers seems to be caused by SQL
> Server's optimizer that uses the _declared_ length of a column to
> estimate the memory needed for the aggregation (or sorting).
>
> Now, we all know that there is no performance difference whatsoever for
> varchar columns regardless of the declared length.
>
> In one of the comments, to that answer the question was asked how
> Postgres knows how much memory it needs to allocate to do the aggregation.
>
> I guess this is based on the column statistics stored in pg_stats, but I
> am not sure:
>

It is based on the average length of values in that column, yes.

We estimate the number of distinct groups produced by the aggregation,
and multiply it by average length of the key(s). The declared maximum
length of a column does not matter.

So if the grouping is expected to produce 1000 groups, and each key
column is 100B on average, 100kB should be enough - but only for the
keys. The estimate also has to include the aggregate states, which is a
different thing.

 >
> So here is my question: how does Postgres estimate/know the memory
> needed for the aggregation? Or does it dynamically resize the memory if
> the initial assumption was wrong?
>

I'm not sure what you mean by 'dynamically resize'. The above decision
is pretty much how planner decides whether to use hash aggregate or
group aggregate. If we estimate that the hash aggregate will fit into
work_mem, the planner will consider both possibilities. If the estimate
says hash aggregate would not fit into work_mem, we'll only consider
group aggregate, because that can work with very little memory.

At execution time we'll only use as much memory as actually needed. The
trouble is that if we under-estimated the amount of memory, there's no
way back.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [GENERAL] How does Postgres estimate the memory needed forsorting/aggregating

From

John R Pierce

Date:

26 January 2017, 00:47:41

On 1/25/2017 12:59 PM, Thomas Kellerer wrote:
> So here is my question: how does Postgres estimate/know the memory
> needed for the aggregation? Or does it dynamically resize the memory
> if the initial assumption was wrong?

my understanding is it fits as much as it can into a work_mem sized
allocation, and if thats not enough uses temporary files and multiple
passes.

--
john r pierce, recycling bits in santa cruz

[GENERAL] Re: How does Postgres estimate the memory needed forsorting/aggregating

From

Thomas Kellerer

Date:

26 January 2017, 01:00:11

Tomas Vondra schrieb am 25.01.2017 um 22:46:
>> I guess this is based on the column statistics stored in pg_stats, but I
>> am not sure:
>>
>
> It is based on the average length of values in that column, yes.

Thanks for confirming that.

I assume this is taken from pg_stats.avg_width ?

> I'm not sure what you mean by 'dynamically resize'. The above
> decision is pretty much how planner decides whether to use hash
> aggregate or group aggregate. If we estimate that the hash aggregate
> will fit into work_mem, the planner will consider both possibilities.
> If the estimate says hash aggregate would not fit into work_mem,
> we'll only consider group aggregate, because that can work with very
> little memory.
>
> At execution time we'll only use as much memory as actually needed.
> The trouble is that if we under-estimated the amount of memory,
> there's no way back.

The "under-estimation" is what I am referring to with "dynamically resize".

What happens if the planner assumes 100kb but in reality it needs 100MB?

Thomas

Re: [GENERAL] How does Postgres estimate the memory needed forsorting/aggregating

From

Tomas Vondra

Date:

26 January 2017, 01:06:44

On 01/25/2017 10:47 PM, John R Pierce wrote:
> On 1/25/2017 12:59 PM, Thomas Kellerer wrote:
>> So here is my question: how does Postgres estimate/know the memory
>> needed for the aggregation? Or does it dynamically resize the memory
>> if the initial assumption was wrong?
>
> my understanding is it fits as much as it can into a work_mem sized
> allocation, and if thats not enough uses temporary files and multiple
> passes.
>

That only works for hash joins, not for hash aggregates. Hash aggregate
is about the only operation in PostgreSQL that can cause OOM because of
under-estimation.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services