Thread: queries with DISTINCT / GROUP BY giving different plans
Hi all, I've run into a strange plan difference on 9.1.9 - the first query does "DISTINCT" by doing a GROUP BY on the columns (both INT). SELECT "f_account"."name_id" AS "a_1550", "f_message"."text_id" AS "a_1562" FROM "f_accountmessagefact" INNER JOIN "f_message" ON ( "f_accountmessagefact"."message_id" = "f_message"."id" ) INNER JOIN "f_account" ON ( "f_accountmessagefact"."account_id" = "f_account"."id" ) GROUP BY 1, 2; QUERY PLAN ----------------------------------------------------------------------------------------- Group (cost=3575011.59..3721066.43 rows=19473978 width=8) -> Sort (cost=3575011.59..3623696.54 rows=19473978 width=8) Sort Key: f_account.name_id, f_message.text_id -> Hash Join (cost=51718.44..1217195.39 rows=19473978 width=8) Hash Cond: (f_accountmessagefact.account_id = f_account.id) -> Hash Join (cost=51699.42..949409.18 rows=19473978 width=8) Hash Cond: (f_accountmessagefact.message_id = f_message.id) -> Seq Scan on f_accountmessagefact (cost=0.00..435202.78 rows=19473978 width=8) -> Hash (cost=37002.52..37002.52 rows=1175752 width=8) -> Seq Scan on f_message (cost=0.00..37002.52 rows=1175752 width=8) -> Hash (cost=11.23..11.23 rows=623 width=8) -> Seq Scan on f_account (cost=0.00..11.23 rows=623 width=8) (12 rows) Now, this takes ~45 seconds to execute, but after rewriting the query to use the regular DISTINCT it suddenly switches to HashAggregate with ~1/3 the cost (although it produces the same output, AFAIK), and it executes in ~15 seconds. SELECT DISTINCT "f_account"."name_id" AS "a_1550", "f_message"."text_id" AS "a_1562" FROM "f_accountmessagefact" INNER JOIN "f_message" ON ( "f_accountmessagefact"."message_id" = "f_message"."id" ) INNER JOIN "f_account" ON ( "f_accountmessagefact"."account_id" = "f_account"."id" ); QUERY PLAN ------------------------------------------------------------------------------------------ HashAggregate (cost=1314565.28..1509305.06 rows=19473978 width=8) -> Hash Join (cost=51718.44..1217195.39 rows=19473978 width=8) Hash Cond: (f_accountmessagefact.account_id = f_account.id) -> Hash Join (cost=51699.42..949409.18 rows=19473978 width=8) Hash Cond: (f_accountmessagefact.message_id = f_message.id) -> Seq Scan on f_accountmessagefact (cost=0.00..435202.78 rows=19473978 width=8) -> Hash (cost=37002.52..37002.52 rows=1175752 width=8) -> Seq Scan on f_message (cost=0.00..37002.52 rows=1175752 width=8) -> Hash (cost=11.23..11.23 rows=623 width=8) -> Seq Scan on f_account (cost=0.00..11.23 rows=623 width=8) (10 rows) I've tested this with other queries and those actually behave as expected (both using HashAggregate), so I'm wondering what's wrong with this one and why it's discarding a plan with much lower cost. Any ideas? The estimates are quite exact (and exactly the same for both queries). BTW I can't test this on 9.2 or 9.3 easily, as this is our production environment and I can't just export the data. I've tried to simulate this but so far no luck. regards Tomas
"Tomas Vondra" <tv@fuzzy.cz> writes: > I've run into a strange plan difference on 9.1.9 - the first query does > "DISTINCT" by doing a GROUP BY on the columns (both INT). ... > Now, this takes ~45 seconds to execute, but after rewriting the query to > use the regular DISTINCT it suddenly switches to HashAggregate with ~1/3 > the cost (although it produces the same output, AFAIK), and it executes in > ~15 seconds. [ scratches head... ] I guess you're running into some corner case where choose_hashed_grouping and choose_hashed_distinct make different choices. It's going to be tough to debug without a test case though. I couldn't reproduce the behavior in a few tries here. > BTW I can't test this on 9.2 or 9.3 easily, as this is our production > environment and I can't just export the data. I've tried to simulate this > but so far no luck. I suppose they won't yet you step through those two functions with a debugger either ... regards, tom lane
On 14.8.2013 20:35, Tom Lane wrote: > "Tomas Vondra" <tv@fuzzy.cz> writes: >> I've run into a strange plan difference on 9.1.9 - the first query >> does "DISTINCT" by doing a GROUP BY on the columns (both INT). ... >> Now, this takes ~45 seconds to execute, but after rewriting the >> query to use the regular DISTINCT it suddenly switches to >> HashAggregate with ~1/3 the cost (although it produces the same >> output, AFAIK), and it executes in ~15 seconds. > > [ scratches head... ] I guess you're running into some corner case > where choose_hashed_grouping and choose_hashed_distinct make > different choices. It's going to be tough to debug without a test > case though. I couldn't reproduce the behavior in a few tries here. > >> BTW I can't test this on 9.2 or 9.3 easily, as this is our >> production environment and I can't just export the data. I've tried >> to simulate this but so far no luck. > > I suppose they won't yet you step through those two functions with a > debugger either ... I've managed to get the data to a different machine, and I've spent some time on debugging it. It seems that the difference is in evaluating hashentrysize - while
On 14.8.2013 20:35, Tom Lane wrote: > "Tomas Vondra" <tv@fuzzy.cz> writes: >> I've run into a strange plan difference on 9.1.9 - the first query >> does "DISTINCT" by doing a GROUP BY on the columns (both INT). ... >> Now, this takes ~45 seconds to execute, but after rewriting the >> query to use the regular DISTINCT it suddenly switches to >> HashAggregate with ~1/3 the cost (although it produces the same >> output, AFAIK), and it executes in ~15 seconds. > > [ scratches head... ] I guess you're running into some corner case > where choose_hashed_grouping and choose_hashed_distinct make > different choices. It's going to be tough to debug without a test > case though. I couldn't reproduce the behavior in a few tries here. > >> BTW I can't test this on 9.2 or 9.3 easily, as this is our >> production environment and I can't just export the data. I've tried >> to simulate this but so far no luck. > > I suppose they won't yet you step through those two functions with a > debugger either ... OK, this time the complete message ... I've managed to get the data to a different machine, and I've spent some time on debugging it. It seems that the difference is in evaluating hashentrysize - while choose_hashed_distinct does this: /* * Don't do it if it doesn't look like the hashtable will fit into * work_mem. */ hashentrysize = MAXALIGN(path_width) + MAXALIGN(sizeof(MinimalTupleData)); if (hashentrysize * dNumDistinctRows > work_mem * 1024L) return false; while choose_hashed_grouping does this: /* Estimate per-hash-entry space at tuple width... */ hashentrysize = MAXALIGN(path_width) + MAXALIGN(sizeof(MinimalTupleData)); /* plus space for pass-by-ref transition values... */ hashentrysize += agg_costs->transitionSpace; /* plus the per-hash-entry overhead */ hashentrysize += hash_agg_entry_size(agg_costs->numAggs); if (hashentrysize * dNumGroups > work_mem * 1024L) return false; In both cases the common parameter values are dNumGroups = dNumDistinctRows = 20451018 work_mem = 819200 but the hashentrysize size is 24 (choose_hashed_distinct) or 56 (choose_hashed_grouping). This causes that while _distinct evaluates the condition as false, and _grouping as true (and thus returns false). Now, the difference between 24 and 56 is caused by hash_agg_entry_size. It's called with numAggs=0 but returns 32. I'm wondering if it should return 0 in such cases, i.e. something like this: Size hash_agg_entry_size(int numAggs) { Size entrysize; if (numAggs == 0) return 0; /* This must match build_hash_table */ entrysize = sizeof(AggHashEntryData) + (numAggs - 1) * sizeof(AggStatePerGroupData); entrysize = MAXALIGN(entrysize); /* Account for hashtable overhead (assuming fill factor = 1) */ entrysize += 3 * sizeof(void *); return entrysize; } I've tested that after this both queries use HashAggregate (which is the right choice), but I haven't done any extensive checking so maybe I'm missing something. Tomas
On 16.8.2013 21:36, Tomas Vondra wrote: > > Now, the difference between 24 and 56 is caused by hash_agg_entry_size. > It's called with numAggs=0 but returns 32. I'm wondering if it should > return 0 in such cases, i.e. something like this: > > Size > hash_agg_entry_size(int numAggs) > { > Size entrysize; > > if (numAggs == 0) > return 0; > > /* This must match build_hash_table */ > entrysize = sizeof(AggHashEntryData) + > (numAggs - 1) * sizeof(AggStatePerGroupData); > entrysize = MAXALIGN(entrysize); > /* Account for hashtable overhead (assuming fill factor = 1) */ > entrysize += 3 * sizeof(void *); > return entrysize; > } > > I've tested that after this both queries use HashAggregate (which is the > right choice), but I haven't done any extensive checking so maybe I'm > missing something. So, is this a sufficient / correct explanation? Any comments about the fix I suggested? Or should I try to get a permission to provide the data so that you can reproduce the issue on your own? That might take a few days to get through. Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > I've managed to get the data to a different machine, and I've spent some > time on debugging it. Great, thanks for looking into it! > It seems that the difference is in evaluating hashentrysize > [ choose_hashed_distinct omits hash_agg_entry_size() ] > but the hashentrysize size is 24 (choose_hashed_distinct) or 56 > (choose_hashed_grouping). This causes that while _distinct evaluates the > condition as false, and _grouping as true (and thus returns false). Hah. > Now, the difference between 24 and 56 is caused by hash_agg_entry_size. > It's called with numAggs=0 but returns 32. I'm wondering if it should > return 0 in such cases, i.e. something like this: No, I don't think so. I'm pretty sure the reason choose_hashed_distinct is like that is that I subconsciously assumed hash_agg_entry_size would produce zero for numAggs = 0; but in fact it does not and should not, because there's still some overhead for the per-group hash entry whether or not there's any aggregates. So the right fix is that choose_hashed_distinct should add hash_agg_entry_size(0) onto its hashentrysize estimate. A separate issue is that the use of numAggs-1 in hash_agg_entry_size's calculations seems a bit risky if numAggs can be zero - I'm not sure we can rely on compilers to get that right. I'm inclined to replace that with use of offsetof. Likewise in build_hash_table. > I've tested that after this both queries use HashAggregate (which is the > right choice), but I haven't done any extensive checking so maybe I'm > missing something. It might be the preferable choice in this example, but you're looking at an edge case. If you want the thing to be using a hash aggregate for this size of problem, you should increase work_mem. regards, tom lane
On 20.8.2013 18:24, Tom Lane wrote: > Tomas Vondra <tv@fuzzy.cz> writes: >> I've managed to get the data to a different machine, and I've spent >> some time on debugging it. > > Great, thanks for looking into it! > >> It seems that the difference is in evaluating hashentrysize [ >> choose_hashed_distinct omits hash_agg_entry_size() ] but the >> hashentrysize size is 24 (choose_hashed_distinct) or 56 >> (choose_hashed_grouping). This causes that while _distinct >> evaluates the condition as false, and _grouping as true (and thus >> returns false). > > Hah. > >> Now, the difference between 24 and 56 is caused by >> hash_agg_entry_size. It's called with numAggs=0 but returns 32. I'm >> wondering if it should return 0 in such cases, i.e. something like >> this: > > No, I don't think so. I'm pretty sure the reason > choose_hashed_distinct is like that is that I subconsciously assumed > hash_agg_entry_size would produce zero for numAggs = 0; but in fact > it does not and should not, because there's still some overhead for > the per-group hash entry whether or not there's any aggregates. So > the right fix is that choose_hashed_distinct should add > hash_agg_entry_size(0) onto its hashentrysize estimate. > > A separate issue is that the use of numAggs-1 in > hash_agg_entry_size's calculations seems a bit risky if numAggs can > be zero - I'm not sure we can rely on compilers to get that right. > I'm inclined to replace that with use of offsetof. Likewise in > build_hash_table. > >> I've tested that after this both queries use HashAggregate (which >> is the right choice), but I haven't done any extensive checking so >> maybe I'm missing something. > > It might be the preferable choice in this example, but you're looking > at an edge case. If you want the thing to be using a hash aggregate > for this size of problem, you should increase work_mem. Hmmm. I think the main 'issue' here is that the queries behave quite differently although it seems like they should do the same thing (well, I understand they're not the same). We're already using work_mem='800MB' so there's not much room to increase this. Actually, this is probably the main reason why we haven't seen this issue more often, because the other dataset are smaller (but that won't last for long, because of steady growth). A complete explain analyze for the HashAggregate plan is available here: http://explain.depesz.com/s/jCO The estimates seem to be pretty exact, except for the very last step: HashAggregate (cost=1399795.00..1604305.06 rows=20451006 width=8) (actual time=13985.580..14106.708 rows=355600 loops=1) So, the estimate is ~60x higher than the actual value, which then happens to work for choose_hashed_distinct (because it uses much lower value for hashentrysize), but for choose_hashed_grouping this is actually above the threshold. But then again, the actual number of rows is much lower than the estimate so that the amount of memory is actually well within work_mem so it does not cause any trouble with OOM. So I don't think increasing the work_mem is a good long-term solution here, because the main problem here is the estimate. Another sign I should probably start working on the multi-column indexes as I planned for a long time ... Anyway, I still don't understand why the same logic around hash_agg_entry_size should not apply to choose_hashed_grouping as well? Well, it would make it slower in this particular corner case, but wouldn't it be more correct? Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > On 20.8.2013 18:24, Tom Lane wrote: >> No, I don't think so. I'm pretty sure the reason >> choose_hashed_distinct is like that is that I subconsciously assumed >> hash_agg_entry_size would produce zero for numAggs = 0; but in fact >> it does not and should not, because there's still some overhead for >> the per-group hash entry whether or not there's any aggregates. So >> the right fix is that choose_hashed_distinct should add >> hash_agg_entry_size(0) onto its hashentrysize estimate. > Hmmm. I think the main 'issue' here is that the queries behave quite > differently although it seems like they should do the same thing (well, > I understand they're not the same). They are the same, assuming we choose to use hashed grouping for both. It's somewhat unfortunate that the planner treats work_mem as a hard boundary; if the estimated memory requirement is 1 byte over the limit, it will not consider a hashed aggregation, period. It might be better if we applied some kind of sliding penalty. On the other hand, given that the calculation depends on an ndistinct estimate that's frequently pretty bad, it doesn't seem wise to me to have a calculation that's encouraging use of hashed aggregation by underestimating the space needed per row; and the current code in choose_hashed_distinct is definitely doing that. A quick experiment (grouping a single float8 row) says that the actual space consumption per group is about 80 bytes, using HEAD on a 64-bit machine. This compares to choose_hashed_grouping's estimate of 56 bytes and choose_hashed_distinct's estimate of 24. I think the remaining discrepancy is because the estimation code isn't allowing for palloc overhead --- the actual space for each representative tuple is really more than MAXALIGN(data_width) + MAXALIGN(sizeof(tuple_header)). I'm not entirely sure if we should add in the palloc overhead, but I am pretty sure that choose_hashed_distinct isn't doing anyone any favors by being wrong by a factor of 3. > Anyway, I still don't understand why the same logic around > hash_agg_entry_size should not apply to choose_hashed_grouping as well? > Well, it would make it slower in this particular corner case, but > wouldn't it be more correct? choose_hashed_grouping has it right, or at least more nearly right. choose_hashed_distinct is simply failing to account for space that will in fact be consumed. Not fixing that is not a good way to deal with inaccurate number-of-groups estimates; if that estimate is low rather than high, the consequences will be a lot worse than they are here. regards, tom lane
On 20.8.2013 23:02, Tom Lane wrote: > Tomas Vondra <tv@fuzzy.cz> writes: > >> Anyway, I still don't understand why the same logic around >> hash_agg_entry_size should not apply to choose_hashed_grouping as >> well? Well, it would make it slower in this particular corner case, >> but wouldn't it be more correct? Meh, I meant it the other way around - applying the hashentrysize logic from hashed_grouping to hashed_distinct. So that both use 56B. > choose_hashed_grouping has it right, or at least more nearly right. > choose_hashed_distinct is simply failing to account for space that > will in fact be consumed. Not fixing that is not a good way to deal > with inaccurate number-of-groups estimates; if that estimate is low > rather than high, the consequences will be a lot worse than they are > here. Not quite sure how to parse this (not a native speaker here, sorry). Does that mean we want to keep it as it is now (because fixing it would cause even worse errors with low estimates)? Or do we want to fix hashed_distinct so that it behaves like hashed_grouping? Tomas
Tomas Vondra <tv@fuzzy.cz> writes: > Not quite sure how to parse this (not a native speaker here, sorry). > Does that mean we want to keep it as it is now (because fixing it would > cause even worse errors with low estimates)? Or do we want to fix > hashed_distinct so that it behaves like hashed_grouping? We need to fix hashed_distinct like this: diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index bcc0d45..99284cb 100644 *** a/src/backend/optimizer/plan/planner.c --- b/src/backend/optimizer/plan/planner.c *************** choose_hashed_distinct(PlannerInfo *root *** 2848,2854 **** --- 2848,2858 ---- * Don't do it if it doesn't look like the hashtable will fit into * work_mem. */ + + /* Estimate per-hash-entry space at tuple width... */ hashentrysize = MAXALIGN(path_width) + MAXALIGN(sizeof(MinimalTupleData)); + /* plus the per-hash-entry overhead */ + hashentrysize += hash_agg_entry_size(0); if (hashentrysize * dNumDistinctRows > work_mem * 1024L) return false; I've started a thread over in -hackers about whether it's prudent to back-patch this change or not. regards, tom lane