Thread: BUG #5231: SELECT DISTINCT poorly implemented vs SELECT ... GROUP BY

BUG #5231: SELECT DISTINCT poorly implemented vs SELECT ... GROUP BY

From

"Thomas Hamilton"

Date:

03 December 2009, 12:20:03

The following bug has been logged online:

Bug reference:      5231
Logged by:          Thomas Hamilton
Email address:      thomashamilton76@yahoo.com
PostgreSQL version: 8.3.8
Operating system:   Ubuntu 4.2.4
Description:        SELECT DISTINCT poorly implemented vs SELECT ... GROUP
BY
Details:

SELECT DISTINCT does a Sort followed by Unique.

SELECT ... GROUP BY, which is logically equivalent, performs a
HashAggregate.

When run against a large dataset with a small number of distinct results
HashAggregate is an order of magnitude more efficient!

Since the spec does not require DISTINCT to return sorted results, I don't
believe Sort ... Unique will ever be more efficient than HashAggregate.

Therefore, in order to maximize performance, DISTINCT should always be
implemented as HashAggregate.

Re: BUG #5231: SELECT DISTINCT poorly implemented vs SELECT ... GROUP BY

From

Joshua Tolley

Date:

03 December 2009, 12:25:12

On Thu, Dec 03, 2009 at 03:56:05PM +0000, Thomas Hamilton wrote:
>=20
> The following bug has been logged online:
>=20
> Bug reference:      5231
> Logged by:          Thomas Hamilton
> Email address:      thomashamilton76@yahoo.com
> PostgreSQL version: 8.3.8
> Operating system:   Ubuntu 4.2.4
> Description:        SELECT DISTINCT poorly implemented vs SELECT ... GROUP
> BY
> Details:=20
>=20
> SELECT DISTINCT does a Sort followed by Unique.
>=20
> SELECT ... GROUP BY, which is logically equivalent, performs a
> HashAggregate.
>=20
> When run against a large dataset with a small number of distinct results
> HashAggregate is an order of magnitude more efficient!
>=20
> Since the spec does not require DISTINCT to return sorted results, I don't
> believe Sort ... Unique will ever be more efficient than HashAggregate.
>=20
> Therefore, in order to maximize performance, DISTINCT should always be
> implemented as HashAggregate.

In 8.4 and above, SELECT DISTINCT can use HashAggregates rather than
Sort/Unique -- though I'm not sure it always does. Anyway, an upgrade should
handle this for you. However, this isn't really a bug, it's a feature, so
don't expect to see it in an 8.3.x version.

-- Josh / eggyknap

Re: BUG #5231: SELECT DISTINCT poorly implemented vs SELECT ... GROUP BY

From

Alvaro Herrera

Date:

03 December 2009, 13:50:03

Thomas Hamilton wrote:

> When run against a large dataset with a small number of distinct results
> HashAggregate is an order of magnitude more efficient!
>
> Since the spec does not require DISTINCT to return sorted results, I don't
> believe Sort ... Unique will ever be more efficient than HashAggregate.

This is not a bug -- merely a missing optimizer feature, which as it
turns out was implemented in 8.4.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.