Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from - Mailing list pgsql-general

From George Neuner
Subject Re: [GENERAL] DISTINCT vs GROUP BY - was Re: is (not) distinct from
Date
Msg-id k0kkbctm5k1j3ui2jmjaftrspj3erjilk1@4ax.com
Whole thread Raw
In response to [GENERAL] is (not) distinct from  (Johann Spies <johann.spies@gmail.com>)
List pgsql-general
Hi David,

On Sat, 4 Mar 2017 02:32:48 +1300, David Rowley
<david.rowley@2ndquadrant.com> wrote:

>On 3 March 2017 at 18:26, George Neuner <gneuner2@comcast.net> wrote:
>> I know most people here don't pay much - or any - attention to
>> SQLServer, however there was an interesting article recently regarding
>> significant performance differences between DISTINCT and GROUP BY as
>> used to remove duplicates.
>>
>> https://sqlperformance.com/2017/01/t-sql-queries/surprises-assumptions-group-by-distinct
>>
>>
>> Now I'm wondering if something similar might be lurking in Postgresql?
>
>Yes things lurk there in PostgreSQL too. But to be honest I find the
>examples in the URL you included a bit strange. There's almost
>certainly going to be a table called "orders" that you'd use for the
>outer part of the query. In that case the orderid would already be
>unique.  To do the same in PostgreSQL you'd just use: select orderid,
>string_agg(description,'|') from orderitems group by orderid; assuming
>all orders had at least one line, you'd get the same result.


The author mentioned at the beginning that the simple queries:

  SELECT DISTINCT Description FROM Sales.OrderLines
  SELECT Description FROM Sales.OrderLines GROUP BY Description;

wouldn't display the subject behavior.

Of course, analyzing the much more complex queries is much more
difficult.  It begs the question: what actually is going on there?


But I don't use SQL Server ... my interest is in how Postgresql deals
with a similar situation.


>Assuming that parts.partcode is the PRIMARY KEY of parts, this query
>is legal in PostgreSQL. In some other databases, and I believe SQL
>Server might be one of them, you would have been forced to include
>part.description in the GROUP BY clause. Since PostgreSQL 9.6, if
>you'd have done the same with that, internally the database would
>ignore the parts.description in the GROUP BY clause, as its smart
>enough to know that including parts.description in the clause is not
>going to change anything as the description is always the same for
>each parts.partcode, and no two records can share the same partcode.
>
>There's no such optimisation when it comes to DISTINCT. In PostgreSQL
>as of today DISTINCT is a bit naive, and will just uniquify the
>results on each column in the select clause.  Although quite possibly
>the same optimisation could apply to DISTINCT too, just nobody has
>thought to add it yet.
>
>In short, the main difference is going to be the fewer columns you're
>using to identify the groups the better. If you included all of the
>columns in the GROUP BY clause as you put in the select list with the
>DISTINCT query then in most cases the performance would be the same. I
>believe the only exception to this is in regards to parallel query, as
>currently only GROUP BYs may be parallelised, not DISTINCT.
>
>Historically with older now unsupported versions of PostgreSQL (pre
>8.4) you may have also preferred to use GROUP BY over DISTINCT as
>GROUP BY could be implemented internally by sorting or hashing the
>results, whereas DISTINCT used to only be implemented by Sorting the
>results. Although this has long since been the case.


I often have occasion to use multiple mapping relations: e.g.,
  A{1}->B{N}
  C{1}->B{N}
together in a query where C is provided and I need to find the
corresponding A(s).  Frequently these queries result in the same A
being found multiple times.

Although the mapping tuples are small [usually just a pair of keys],
the number of rows in the mapping tables may be very large, and a
given query may need to join/work its way through several such
mappings.

Typically in such situations, I divide the query using CTEs and (try
to) minimize the volume of data at each step by filtering duplicates
from any results that might include them.

I have always used DISTINCT to filter duplication, reserving GROUP BY
for aggregations (counting, etc.).  But if I understand correctly, you
are saying that GROUP BY should be preferred even for the simpler use.


George

pgsql-general by date:

Previous
From: Adrian Klaver
Date:
Subject: Re: [GENERAL] Querying JSON Lists
Next
From: "Peter J. Holzer"
Date:
Subject: Re: [GENERAL] Autoanalyze oddity