Thread: queries with DISTINCT / GROUP BY giving different plans

queries with DISTINCT / GROUP BY giving different plans

From
"Tomas Vondra"
Date:
Hi all,

I've run into a strange plan difference on 9.1.9 - the first query does
"DISTINCT" by doing a GROUP BY on the columns (both INT).

SELECT
  "f_account"."name_id" AS "a_1550",
  "f_message"."text_id" AS "a_1562"
FROM "f_accountmessagefact"
   INNER JOIN "f_message" ON ( "f_accountmessagefact"."message_id" =
"f_message"."id" )
   INNER JOIN "f_account" ON ( "f_accountmessagefact"."account_id" =
"f_account"."id" )
GROUP BY 1, 2;

                                 QUERY PLAN
-----------------------------------------------------------------------------------------
 Group  (cost=3575011.59..3721066.43 rows=19473978 width=8)
   ->  Sort  (cost=3575011.59..3623696.54 rows=19473978 width=8)
         Sort Key: f_account.name_id, f_message.text_id
         ->  Hash Join  (cost=51718.44..1217195.39 rows=19473978 width=8)
               Hash Cond: (f_accountmessagefact.account_id = f_account.id)
               ->  Hash Join  (cost=51699.42..949409.18 rows=19473978
width=8)
                     Hash Cond: (f_accountmessagefact.message_id =
f_message.id)
                     ->  Seq Scan on f_accountmessagefact
(cost=0.00..435202.78 rows=19473978 width=8)
                     ->  Hash  (cost=37002.52..37002.52 rows=1175752 width=8)
                           ->  Seq Scan on f_message  (cost=0.00..37002.52
rows=1175752 width=8)
               ->  Hash  (cost=11.23..11.23 rows=623 width=8)
                     ->  Seq Scan on f_account  (cost=0.00..11.23 rows=623
width=8)
(12 rows)

Now, this takes ~45 seconds to execute, but after rewriting the query to
use the regular DISTINCT it suddenly switches to HashAggregate with ~1/3
the cost (although it produces the same output, AFAIK), and it executes in
~15 seconds.

SELECT DISTINCT
  "f_account"."name_id" AS "a_1550",
  "f_message"."text_id" AS "a_1562"
FROM "f_accountmessagefact"
   INNER JOIN "f_message" ON ( "f_accountmessagefact"."message_id" =
"f_message"."id" )
   INNER JOIN "f_account" ON ( "f_accountmessagefact"."account_id" =
"f_account"."id" );

                                  QUERY PLAN
------------------------------------------------------------------------------------------
 HashAggregate  (cost=1314565.28..1509305.06 rows=19473978 width=8)
   ->  Hash Join  (cost=51718.44..1217195.39 rows=19473978 width=8)
         Hash Cond: (f_accountmessagefact.account_id = f_account.id)
         ->  Hash Join  (cost=51699.42..949409.18 rows=19473978 width=8)
               Hash Cond: (f_accountmessagefact.message_id = f_message.id)
               ->  Seq Scan on f_accountmessagefact  (cost=0.00..435202.78
rows=19473978 width=8)
               ->  Hash  (cost=37002.52..37002.52 rows=1175752 width=8)
                     ->  Seq Scan on f_message  (cost=0.00..37002.52
rows=1175752 width=8)
         ->  Hash  (cost=11.23..11.23 rows=623 width=8)
               ->  Seq Scan on f_account  (cost=0.00..11.23 rows=623 width=8)
(10 rows)

I've tested this with other queries and those actually behave as expected
(both using HashAggregate), so I'm wondering what's wrong with this one
and why it's discarding a plan with much lower cost. Any ideas?

The estimates are quite exact (and exactly the same for both queries).

BTW I can't test this on 9.2 or 9.3 easily, as this is our production
environment and I can't just export the data. I've tried to simulate this
but so far no luck.

regards
Tomas



Re: queries with DISTINCT / GROUP BY giving different plans

From
Tom Lane
Date:
"Tomas Vondra" <tv@fuzzy.cz> writes:
> I've run into a strange plan difference on 9.1.9 - the first query does
> "DISTINCT" by doing a GROUP BY on the columns (both INT). ...
> Now, this takes ~45 seconds to execute, but after rewriting the query to
> use the regular DISTINCT it suddenly switches to HashAggregate with ~1/3
> the cost (although it produces the same output, AFAIK), and it executes in
> ~15 seconds.

[ scratches head... ]  I guess you're running into some corner case where
choose_hashed_grouping and choose_hashed_distinct make different choices.
It's going to be tough to debug without a test case though.  I couldn't
reproduce the behavior in a few tries here.

> BTW I can't test this on 9.2 or 9.3 easily, as this is our production
> environment and I can't just export the data. I've tried to simulate this
> but so far no luck.

I suppose they won't yet you step through those two functions with a
debugger either ...

            regards, tom lane


Re: queries with DISTINCT / GROUP BY giving different plans

From
Tomas Vondra
Date:
On 14.8.2013 20:35, Tom Lane wrote:
> "Tomas Vondra" <tv@fuzzy.cz> writes:
>> I've run into a strange plan difference on 9.1.9 - the first query
>> does "DISTINCT" by doing a GROUP BY on the columns (both INT). ...
>> Now, this takes ~45 seconds to execute, but after rewriting the
>> query to use the regular DISTINCT it suddenly switches to
>> HashAggregate with ~1/3 the cost (although it produces the same
>> output, AFAIK), and it executes in ~15 seconds.
>
> [ scratches head... ]  I guess you're running into some corner case
> where choose_hashed_grouping and choose_hashed_distinct make
> different choices. It's going to be tough to debug without a test
> case though.  I couldn't reproduce the behavior in a few tries here.
>
>> BTW I can't test this on 9.2 or 9.3 easily, as this is our
>> production environment and I can't just export the data. I've tried
>> to simulate this but so far no luck.
>
> I suppose they won't yet you step through those two functions with a
> debugger either ...

I've managed to get the data to a different machine, and I've spent some
time on debugging it. It seems that the difference is in evaluating
hashentrysize - while




Re: queries with DISTINCT / GROUP BY giving different plans

From
Tomas Vondra
Date:
On 14.8.2013 20:35, Tom Lane wrote:
> "Tomas Vondra" <tv@fuzzy.cz> writes:
>> I've run into a strange plan difference on 9.1.9 - the first query
>> does "DISTINCT" by doing a GROUP BY on the columns (both INT). ...
>> Now, this takes ~45 seconds to execute, but after rewriting the
>> query to use the regular DISTINCT it suddenly switches to
>> HashAggregate with ~1/3 the cost (although it produces the same
>> output, AFAIK), and it executes in ~15 seconds.
>
> [ scratches head... ]  I guess you're running into some corner case
> where choose_hashed_grouping and choose_hashed_distinct make
> different choices. It's going to be tough to debug without a test
> case though.  I couldn't reproduce the behavior in a few tries here.
>
>> BTW I can't test this on 9.2 or 9.3 easily, as this is our
>> production environment and I can't just export the data. I've tried
>> to simulate this but so far no luck.
>
> I suppose they won't yet you step through those two functions with a
> debugger either ...

OK, this time the complete message ...

I've managed to get the data to a different machine, and I've spent some
time on debugging it. It seems that the difference is in evaluating
hashentrysize - while choose_hashed_distinct does this:

  /*
   * Don't do it if it doesn't look like the hashtable will fit into
   * work_mem.
   */
  hashentrysize = MAXALIGN(path_width)
                + MAXALIGN(sizeof(MinimalTupleData));

  if (hashentrysize * dNumDistinctRows > work_mem * 1024L)
      return false;

while choose_hashed_grouping does this:

  /* Estimate per-hash-entry space at tuple width... */
  hashentrysize = MAXALIGN(path_width)
                + MAXALIGN(sizeof(MinimalTupleData));
  /* plus space for pass-by-ref transition values... */
  hashentrysize += agg_costs->transitionSpace;
  /* plus the per-hash-entry overhead */
  hashentrysize += hash_agg_entry_size(agg_costs->numAggs);

  if (hashentrysize * dNumGroups > work_mem * 1024L)
    return false;

In both cases the common parameter values are

  dNumGroups = dNumDistinctRows = 20451018
  work_mem = 819200

but the hashentrysize size is 24 (choose_hashed_distinct) or 56
(choose_hashed_grouping). This causes that while _distinct evaluates the
condition as false, and _grouping as true (and thus returns false).

Now, the difference between 24 and 56 is caused by hash_agg_entry_size.
It's called with numAggs=0 but returns 32. I'm wondering if it should
return 0 in such cases, i.e. something like this:

    Size
    hash_agg_entry_size(int numAggs)
    {
        Size        entrysize;

        if (numAggs == 0)
            return 0;

        /* This must match build_hash_table */
        entrysize = sizeof(AggHashEntryData) +
            (numAggs - 1) * sizeof(AggStatePerGroupData);
        entrysize = MAXALIGN(entrysize);
        /* Account for hashtable overhead (assuming fill factor = 1) */
        entrysize += 3 * sizeof(void *);
        return entrysize;
    }

I've tested that after this both queries use HashAggregate (which is the
right choice), but I haven't done any extensive checking so maybe I'm
missing something.

Tomas


Re: queries with DISTINCT / GROUP BY giving different plans

From
Tomas Vondra
Date:
On 16.8.2013 21:36, Tomas Vondra wrote:
>
> Now, the difference between 24 and 56 is caused by hash_agg_entry_size.
> It's called with numAggs=0 but returns 32. I'm wondering if it should
> return 0 in such cases, i.e. something like this:
>
>     Size
>     hash_agg_entry_size(int numAggs)
>     {
>         Size        entrysize;
>
>         if (numAggs == 0)
>             return 0;
>
>         /* This must match build_hash_table */
>         entrysize = sizeof(AggHashEntryData) +
>             (numAggs - 1) * sizeof(AggStatePerGroupData);
>         entrysize = MAXALIGN(entrysize);
>         /* Account for hashtable overhead (assuming fill factor = 1) */
>         entrysize += 3 * sizeof(void *);
>         return entrysize;
>     }
>
> I've tested that after this both queries use HashAggregate (which is the
> right choice), but I haven't done any extensive checking so maybe I'm
> missing something.

So, is this a sufficient / correct explanation? Any comments about the
fix I suggested? Or should I try to get a permission to provide the data
so that you can reproduce the issue on your own? That might take a few
days to get through.

Tomas


Re: queries with DISTINCT / GROUP BY giving different plans

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> I've managed to get the data to a different machine, and I've spent some
> time on debugging it.

Great, thanks for looking into it!

> It seems that the difference is in evaluating hashentrysize
> [ choose_hashed_distinct omits hash_agg_entry_size() ]
> but the hashentrysize size is 24 (choose_hashed_distinct) or 56
> (choose_hashed_grouping). This causes that while _distinct evaluates the
> condition as false, and _grouping as true (and thus returns false).

Hah.

> Now, the difference between 24 and 56 is caused by hash_agg_entry_size.
> It's called with numAggs=0 but returns 32. I'm wondering if it should
> return 0 in such cases, i.e. something like this:

No, I don't think so.  I'm pretty sure the reason choose_hashed_distinct
is like that is that I subconsciously assumed hash_agg_entry_size would
produce zero for numAggs = 0; but in fact it does not and should not,
because there's still some overhead for the per-group hash entry whether
or not there's any aggregates.  So the right fix is that
choose_hashed_distinct should add hash_agg_entry_size(0) onto its
hashentrysize estimate.

A separate issue is that the use of numAggs-1 in hash_agg_entry_size's
calculations seems a bit risky if numAggs can be zero - I'm not sure we
can rely on compilers to get that right.  I'm inclined to replace that
with use of offsetof.  Likewise in build_hash_table.

> I've tested that after this both queries use HashAggregate (which is the
> right choice), but I haven't done any extensive checking so maybe I'm
> missing something.

It might be the preferable choice in this example, but you're looking at
an edge case.  If you want the thing to be using a hash aggregate for
this size of problem, you should increase work_mem.

            regards, tom lane


Re: queries with DISTINCT / GROUP BY giving different plans

From
Tomas Vondra
Date:
On 20.8.2013 18:24, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>> I've managed to get the data to a different machine, and I've spent
>> some time on debugging it.
>
> Great, thanks for looking into it!
>
>> It seems that the difference is in evaluating hashentrysize [
>> choose_hashed_distinct omits hash_agg_entry_size() ] but the
>> hashentrysize size is 24 (choose_hashed_distinct) or 56
>> (choose_hashed_grouping). This causes that while _distinct
>> evaluates the condition as false, and _grouping as true (and thus
>> returns false).
>
> Hah.
>
>> Now, the difference between 24 and 56 is caused by
>> hash_agg_entry_size. It's called with numAggs=0 but returns 32. I'm
>> wondering if it should return 0 in such cases, i.e. something like
>> this:
>
> No, I don't think so.  I'm pretty sure the reason
> choose_hashed_distinct is like that is that I subconsciously assumed
> hash_agg_entry_size would produce zero for numAggs = 0; but in fact
> it does not and should not, because there's still some overhead for
> the per-group hash entry whether or not there's any aggregates.  So
> the right fix is that choose_hashed_distinct should add
> hash_agg_entry_size(0) onto its hashentrysize estimate.
>
> A separate issue is that the use of numAggs-1 in
> hash_agg_entry_size's calculations seems a bit risky if numAggs can
> be zero - I'm not sure we can rely on compilers to get that right.
> I'm inclined to replace that with use of offsetof.  Likewise in
> build_hash_table.
>
>> I've tested that after this both queries use HashAggregate (which
>> is the right choice), but I haven't done any extensive checking so
>> maybe I'm missing something.
>
> It might be the preferable choice in this example, but you're looking
> at an edge case.  If you want the thing to be using a hash aggregate
> for this size of problem, you should increase work_mem.

Hmmm. I think the main 'issue' here is that the queries behave quite
differently although it seems like they should do the same thing (well,
I understand they're not the same).

We're already using work_mem='800MB' so there's not much room to
increase this. Actually, this is probably the main reason why we haven't
seen this issue more often, because the other dataset are smaller (but
that won't last for long, because of steady growth).

A complete explain analyze for the HashAggregate plan is available here:
http://explain.depesz.com/s/jCO The estimates seem to be pretty exact,
except for the very last step:

 HashAggregate  (cost=1399795.00..1604305.06 rows=20451006 width=8)
                (actual time=13985.580..14106.708 rows=355600 loops=1)

So, the estimate is ~60x higher than the actual value, which then
happens to work for choose_hashed_distinct (because it uses much lower
value for hashentrysize), but for choose_hashed_grouping this is
actually above the threshold.

But then again, the actual number of rows is much lower than the
estimate so that the amount of memory is actually well within work_mem
so it does not cause any trouble with OOM.

So I don't think increasing the work_mem is a good long-term solution
here, because the main problem here is the estimate. Another sign I
should probably start working on the multi-column indexes as I planned
for a long time ...

Anyway, I still don't understand why the same logic around
hash_agg_entry_size should not apply to choose_hashed_grouping as well?
Well, it would make it slower in this particular corner case, but
wouldn't it be more correct?

Tomas


Re: queries with DISTINCT / GROUP BY giving different plans

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> On 20.8.2013 18:24, Tom Lane wrote:
>> No, I don't think so.  I'm pretty sure the reason
>> choose_hashed_distinct is like that is that I subconsciously assumed
>> hash_agg_entry_size would produce zero for numAggs = 0; but in fact
>> it does not and should not, because there's still some overhead for
>> the per-group hash entry whether or not there's any aggregates.  So
>> the right fix is that choose_hashed_distinct should add
>> hash_agg_entry_size(0) onto its hashentrysize estimate.

> Hmmm. I think the main 'issue' here is that the queries behave quite
> differently although it seems like they should do the same thing (well,
> I understand they're not the same).

They are the same, assuming we choose to use hashed grouping for both.

It's somewhat unfortunate that the planner treats work_mem as a hard
boundary; if the estimated memory requirement is 1 byte over the limit,
it will not consider a hashed aggregation, period.  It might be better if
we applied some kind of sliding penalty.  On the other hand, given that
the calculation depends on an ndistinct estimate that's frequently pretty
bad, it doesn't seem wise to me to have a calculation that's encouraging
use of hashed aggregation by underestimating the space needed per row;
and the current code in choose_hashed_distinct is definitely doing that.
A quick experiment (grouping a single float8 row) says that the actual
space consumption per group is about 80 bytes, using HEAD on a 64-bit
machine.  This compares to choose_hashed_grouping's estimate of 56 bytes
and choose_hashed_distinct's estimate of 24.  I think the remaining
discrepancy is because the estimation code isn't allowing for palloc
overhead --- the actual space for each representative tuple is really
more than MAXALIGN(data_width) + MAXALIGN(sizeof(tuple_header)).
I'm not entirely sure if we should add in the palloc overhead, but
I am pretty sure that choose_hashed_distinct isn't doing anyone any
favors by being wrong by a factor of 3.

> Anyway, I still don't understand why the same logic around
> hash_agg_entry_size should not apply to choose_hashed_grouping as well?
> Well, it would make it slower in this particular corner case, but
> wouldn't it be more correct?

choose_hashed_grouping has it right, or at least more nearly right.
choose_hashed_distinct is simply failing to account for space that
will in fact be consumed.  Not fixing that is not a good way to
deal with inaccurate number-of-groups estimates; if that estimate
is low rather than high, the consequences will be a lot worse than
they are here.

            regards, tom lane


Re: queries with DISTINCT / GROUP BY giving different plans

From
Tomas Vondra
Date:
On 20.8.2013 23:02, Tom Lane wrote:
> Tomas Vondra <tv@fuzzy.cz> writes:
>
>> Anyway, I still don't understand why the same logic around
>> hash_agg_entry_size should not apply to choose_hashed_grouping as
>> well? Well, it would make it slower in this particular corner case,
>> but wouldn't it be more correct?

Meh, I meant it the other way around - applying the hashentrysize logic
from hashed_grouping to hashed_distinct. So that both use 56B.

> choose_hashed_grouping has it right, or at least more nearly right.
> choose_hashed_distinct is simply failing to account for space that
> will in fact be consumed.  Not fixing that is not a good way to deal
> with inaccurate number-of-groups estimates; if that estimate is low
> rather than high, the consequences will be a lot worse than they are
> here.

Not quite sure how to parse this (not a native speaker here, sorry).
Does that mean we want to keep it as it is now (because fixing it would
cause even worse errors with low estimates)? Or do we want to fix
hashed_distinct so that it behaves like hashed_grouping?

Tomas


Re: queries with DISTINCT / GROUP BY giving different plans

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> Not quite sure how to parse this (not a native speaker here, sorry).
> Does that mean we want to keep it as it is now (because fixing it would
> cause even worse errors with low estimates)? Or do we want to fix
> hashed_distinct so that it behaves like hashed_grouping?

We need to fix hashed_distinct like this:

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index bcc0d45..99284cb 100644
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*************** choose_hashed_distinct(PlannerInfo *root
*** 2848,2854 ****
--- 2848,2858 ----
       * Don't do it if it doesn't look like the hashtable will fit into
       * work_mem.
       */
+
+     /* Estimate per-hash-entry space at tuple width... */
      hashentrysize = MAXALIGN(path_width) + MAXALIGN(sizeof(MinimalTupleData));
+     /* plus the per-hash-entry overhead */
+     hashentrysize += hash_agg_entry_size(0);

      if (hashentrysize * dNumDistinctRows > work_mem * 1024L)
          return false;

I've started a thread over in -hackers about whether it's prudent to
back-patch this change or not.

            regards, tom lane