Home > mailing lists

Re: queries with DISTINCT / GROUP BY giving different plans - Mailing list pgsql-performance

From	Tom Lane
Subject	Re: queries with DISTINCT / GROUP BY giving different plans
Date	August 20, 2013 16:25:07
Msg-id	7059.1377015896@sss.pgh.pa.us Whole thread Raw
In response to	Re: queries with DISTINCT / GROUP BY giving different plans (Tomas Vondra <tv@fuzzy.cz>)
Responses	Re: queries with DISTINCT / GROUP BY giving different plans
List	pgsql-performance

Tree view

Tomas Vondra <tv@fuzzy.cz> writes:
> I've managed to get the data to a different machine, and I've spent some
> time on debugging it.

Great, thanks for looking into it!

> It seems that the difference is in evaluating hashentrysize
> [ choose_hashed_distinct omits hash_agg_entry_size() ]
> but the hashentrysize size is 24 (choose_hashed_distinct) or 56
> (choose_hashed_grouping). This causes that while _distinct evaluates the
> condition as false, and _grouping as true (and thus returns false).

Hah.

> Now, the difference between 24 and 56 is caused by hash_agg_entry_size.
> It's called with numAggs=0 but returns 32. I'm wondering if it should
> return 0 in such cases, i.e. something like this:

No, I don't think so.  I'm pretty sure the reason choose_hashed_distinct
is like that is that I subconsciously assumed hash_agg_entry_size would
produce zero for numAggs = 0; but in fact it does not and should not,
because there's still some overhead for the per-group hash entry whether
or not there's any aggregates.  So the right fix is that
choose_hashed_distinct should add hash_agg_entry_size(0) onto its
hashentrysize estimate.

A separate issue is that the use of numAggs-1 in hash_agg_entry_size's
calculations seems a bit risky if numAggs can be zero - I'm not sure we
can rely on compilers to get that right.  I'm inclined to replace that
with use of offsetof.  Likewise in build_hash_table.

> I've tested that after this both queries use HashAggregate (which is the
> right choice), but I haven't done any extensive checking so maybe I'm
> missing something.

It might be the preferable choice in this example, but you're looking at
an edge case.  If you want the thing to be using a hash aggregate for
this size of problem, you should increase work_mem.

            regards, tom lane

pgsql-performance by date:

From: Tomas Vondra
Date: 20 August 2013, 15:13:18
Subject: Re: queries with DISTINCT / GROUP BY giving different plans

From: Tomas Vondra
Date: 20 August 2013, 17:56:09
Subject: Re: queries with DISTINCT / GROUP BY giving different plans

Re: queries with DISTINCT / GROUP BY giving different plans - Mailing list pgsql-performance

Previous

Next