Thread: HashAggregate slower than sort?

HashAggregate slower than sort?

From
"Jatinder Sangha"
Date:
Hi,

I've noticed something that I find strange with the hash-aggregate feature of Postgres. I'm currently running Postgres v8.4.1 on Debian Linux 64-bit.

I have a simple query that when planned either uses hash-aggregates or a sort depending on the amount of working memory available. The problem is that when it uses the hash-aggregates, the query runs 25% slower than when using the sort method.

The table in question contains about 60 columns, many of which are boolean, 32-bit integers and some are 64-bit integers. Many fields are text - and some of these can be quite long (eg 32Kb).

The SQL is as follows:

explain analyse
select distinct T1.*
  from role T1
 where T1.endDate is null and T1.latest=true and T1.active=true and
       T1.deceased=false and T1.desk in (BIG LIST OF INTEGERS);

select version() --> "PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10), 64-bit"

show enable_hashagg --> "on"
set work_mem='8MB'
show work_mem --> "8MB"

Explain analyse of the SQL above:
Unique  (cost=47033.71..48410.27 rows=8881 width=1057) (actual time=18.803..38.969 rows=6449 loops=1)
  ->  Sort  (cost=47033.71..47055.91 rows=8881 width=1057) (actual time=18.801..20.560 rows=6449 loops=1)
        Sort Key: id, version, latest, active, deceased, person, formalnotes, informalnotes, description, desk, rolelevel, roletype, promotiondate, primaryrole, headofplace, careergrading, startdate, enddate, percentsalary, deskf, rolelevelf, roletypef, promotiondatef, primaryrolef, headofplacef, careergradingf, startdatef, enddatef, percentsalaryf, descriptionf, deskmv, rolelevelmv, roletypemv, promotiondatemv, primaryrolemv, headofplacemv, careergradingmv, startdatemv, enddatemv, percentsalarymv, descriptionmv, hasattachments, hasrelationships, hasprojects, audwho, audwhen, audcreated, costcentre, reportsto, manages, startdateest, enddateest, hasstarperformers, projectnames, sourcefrom, sourceto, checkedwho, checkedwhen, checkednotes, hasqueries, querytitles

        Sort Method:  quicksort  Memory: 2001kB
        ->  Bitmap Heap Scan on role t1  (cost=4888.59..42321.27 rows=8881 width=1057) (actual time=7.041..12.504 rows=6449 loops=1)

              Recheck Cond: (desk = ANY ('BIG LIST OF INTEGERS'::bigint[]))
              Filter: ((enddate IS NULL) AND latest AND active AND (NOT deceased))
              ->  Bitmap Index Scan on role_ix2  (cost=0.00..4886.37 rows=10984 width=0) (actual time=6.948..6.948 rows=9296 loops=1)

                    Index Cond: ((latest = true) AND (active = true) AND (deceased = false) AND (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])))

Total runtime: 40.777 ms

This execution of the query used a sort to perform the "distinct".

Now for the second run:

select version() --> "PostgreSQL 8.4.1 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 3.4.6 20060404 (Red Hat 3.4.6-10), 64-bit"

show enable_hashagg --> "on"
set work_mem='64MB'
show work_mem --> "64MB"

Explain analyse of the SQL above:
HashAggregate  (cost=43675.63..43764.44 rows=8881 width=1057) (actual time=46.556..55.694 rows=6449 loops=1)
  ->  Bitmap Heap Scan on role t1  (cost=4888.59..42321.27 rows=8881 width=1057) (actual time=7.179..13.023 rows=6449 loops=1)

        Recheck Cond: (desk = ANY ('BIG LIST OF INTEGERS'::bigint[]))
        Filter: ((enddate IS NULL) AND latest AND active AND (NOT deceased))
        ->  Bitmap Index Scan on role_ix2  (cost=0.00..4886.37 rows=10984 width=0) (actual time=7.086..7.086 rows=9296 loops=1)

              Index Cond: ((latest = true) AND (active = true) AND (deceased = false) AND (desk = ANY ('BIG LIST OF INTEGERS'::bigint[])))

Total runtime: 57.536 ms


I've tested this with v8.4.4 as well with the same results. I also tested the same query with our previous production version of Postgres (v8.3.8) and that version only appears to use sorting not hash-aggregates.

Obviously, I can re-write the query to use a "distinct on (...)" clause to improve performance - which is what I've done, but my question is: Why is the hash-aggregate slower than the sort?

Is it something to do with the number of columns? ie. When sorting, the first few columns defined on the table (id, version) make the row unique - but when using the hash-aggregate feature, presumably every column needs to be hashed which takes longer especially for long text fields?

Thanks,
--Jatinder

Coalition Development Ltd 1st  Floor, One Newhams Row, London, United Kingdom, SE1 3UZ

Registration Number - 04328897 Registered Office - Direct Control 3rd Floor, Marvic House, Bishops Road, London, United Kingdom, SW6 7AD

Re: HashAggregate slower than sort?

From
"Kevin Grittner"
Date:
"Jatinder Sangha" <js@coalition.com> wrote:

> I have a simple query that when planned either uses hash-
> aggregates or a sort depending on the amount of working memory
> available. The problem is that when it uses the hash-aggregates,
> the query runs 25% slower than when using the sort method.
>
> The table in question contains about 60 columns, many of which are
> boolean, 32-bit integers and some are 64-bit integers. Many fields
> are text - and some of these can be quite long (eg 32Kb).

> Obviously, I can re-write the query to use a "distinct on (...)"
> clause

Yeah, that seems prudent, to say the least.

> Why is the hash-aggregate slower than the sort?
>
> Is it something to do with the number of columns? ie. When
> sorting, the first few columns defined on the table (id, version)
> make the row unique - but when using the hash-aggregate feature,
> presumably every column needs to be hashed which takes longer
> especially for long text fields?

Sounds like a reasonable guess to me.  But since you're apparently
retrieving about 9,000 wide rows in (worst case) 56 ms, it would
seem that your active data set may be fully cached.  If so, you
could try reducing both random_page_cost and seq_page_cost to
something in the 0.1 to 0.005 range and see if it improves the
accuracy of the cost estimates.  Not that you should go back to
using DISTINCT on all 60 column, including big text columns; but
these cost factors might help other queries pick faster plans.

-Kevin

Re: HashAggregate slower than sort?

From
"Jatinder Sangha"
Date:
Hi Kevin,

Thanks for the suggestions.

I've already converted all of my SQL to use "distinct on (...)" and this
is now always faster using the hash-aggregates than when using sorting.
The queries now only use sorting if the hashing would take up too much
memory.

Thanks,
--Jatinder

-----Original Message-----
From: Kevin Grittner [mailto:Kevin.Grittner@wicourts.gov]
Sent: 18 June 2010 18:59
To: Jatinder Sangha; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] HashAggregate slower than sort?

"Jatinder Sangha" <js@coalition.com> wrote:

> I have a simple query that when planned either uses hash- aggregates
> or a sort depending on the amount of working memory available. The
> problem is that when it uses the hash-aggregates, the query runs 25%
> slower than when using the sort method.
>
> The table in question contains about 60 columns, many of which are
> boolean, 32-bit integers and some are 64-bit integers. Many fields are

> text - and some of these can be quite long (eg 32Kb).

> Obviously, I can re-write the query to use a "distinct on (...)"
> clause

Yeah, that seems prudent, to say the least.

> Why is the hash-aggregate slower than the sort?
>
> Is it something to do with the number of columns? ie. When sorting,
> the first few columns defined on the table (id, version) make the row
> unique - but when using the hash-aggregate feature, presumably every
> column needs to be hashed which takes longer especially for long text
> fields?

Sounds like a reasonable guess to me.  But since you're apparently
retrieving about 9,000 wide rows in (worst case) 56 ms, it would seem
that your active data set may be fully cached.  If so, you could try
reducing both random_page_cost and seq_page_cost to something in the 0.1
to 0.005 range and see if it improves the accuracy of the cost
estimates.  Not that you should go back to using DISTINCT on all 60
column, including big text columns; but these cost factors might help
other queries pick faster plans.

-Kevin



Coalition Development Ltd 1st  Floor, One Newhams Row, London, United Kingdom, SE1 3UZ
Registration Number - 04328897 Registered Office - Direct Control 3rd Floor, Marvic House, Bishops Road, London, United
Kingdom,SW6 7AD 


Re: HashAggregate slower than sort?

From
"Kevin Grittner"
Date:
"Jatinder Sangha" <js@coalition.com> wrote:

> I've already converted all of my SQL to use "distinct on (...)"
> and this is now always faster using the hash-aggregates than when
> using sorting.  The queries now only use sorting if the hashing
> would take up too much memory.

It's great that you have a solution to your immediate problem, but
if the active portion of your database is really as fully cached as
your problem case indicates, you should probably still tweak the
costing factors.  Doing so will help the optimizer pick good plans
for any arbitrary query you choose to run.  If the caching was
unusual, and was just showing up because were repeatedly running
that one test case, then never mind.  :-)

-Kevin