Thread: explain HashAggregate to report bucket and memory stats
On Sun, Feb 17, 2019 at 11:29:56AM -0500, Jeff Janes wrote: https://www.postgresql.org/message-id/CAMkU%3D1zBJNVo2DGYBgLJqpu8fyjCE_ys%2Bmsr6pOEoiwA7y5jrA%40mail.gmail.com > What would I find very useful is [...] if the HashAggregate node under > "explain analyze" would report memory and bucket stats; and if the Aggregate > node would report...anything. Find attached my WIP attempt to implement this. Jeff: can you suggest what details Aggregate should show ? Justin
Attachment
On Fri, Jan 03, 2020 at 10:19:25AM -0600, Justin Pryzby wrote: > On Sun, Feb 17, 2019 at 11:29:56AM -0500, Jeff Janes wrote: > https://www.postgresql.org/message-id/CAMkU%3D1zBJNVo2DGYBgLJqpu8fyjCE_ys%2Bmsr6pOEoiwA7y5jrA%40mail.gmail.com > > What would I find very useful is [...] if the HashAggregate node under > > "explain analyze" would report memory and bucket stats; and if the Aggregate > > node would report...anything. > > Find attached my WIP attempt to implement this. > > Jeff: can you suggest what details Aggregate should show ? Rebased on top of 10013684970453a0ddc86050bba813c611114321 And added https://commitfest.postgresql.org/27/2428/
On Sun, Jan 26, 2020 at 08:14:25AM -0600, Justin Pryzby wrote: > On Fri, Jan 03, 2020 at 10:19:25AM -0600, Justin Pryzby wrote: > > On Sun, Feb 17, 2019 at 11:29:56AM -0500, Jeff Janes wrote: > > https://www.postgresql.org/message-id/CAMkU%3D1zBJNVo2DGYBgLJqpu8fyjCE_ys%2Bmsr6pOEoiwA7y5jrA%40mail.gmail.com > > > What would I find very useful is [...] if the HashAggregate node under > > > "explain analyze" would report memory and bucket stats; and if the Aggregate > > > node would report...anything. > > > > Find attached my WIP attempt to implement this. > > > > Jeff: can you suggest what details Aggregate should show ? > > Rebased on top of 10013684970453a0ddc86050bba813c611114321 > And added https://commitfest.postgresql.org/27/2428/ Attached for real.
Attachment
Hi, On 2020-01-03 10:19:26 -0600, Justin Pryzby wrote: > On Sun, Feb 17, 2019 at 11:29:56AM -0500, Jeff Janes wrote: > https://www.postgresql.org/message-id/CAMkU%3D1zBJNVo2DGYBgLJqpu8fyjCE_ys%2Bmsr6pOEoiwA7y5jrA%40mail.gmail.com > > What would I find very useful is [...] if the HashAggregate node under > > "explain analyze" would report memory and bucket stats; and if the Aggregate > > node would report...anything. Yea, that'd be amazing. It probably should be something every execGrouping.c using node can opt into. Justin: As far as I can tell, you're trying to share one instrumentation state between hashagg and hashjoins. I'm doubtful that's a good idea. The cases are different enough that that's probably just going to be complicated, without actually simplifying anything. > Jeff: can you suggest what details Aggregate should show ? Memory usage most importantly. Probably makes sense to differentiate between the memory for the hashtable itself, and the tuples in it (since they're allocated separately, and just having a overly large hashtable doesn't hurt that much if it's not filled). > diff --git a/src/backend/executor/execGrouping.c b/src/backend/executor/execGrouping.c > index 3603c58..cf0fe3c 100644 > --- a/src/backend/executor/execGrouping.c > +++ b/src/backend/executor/execGrouping.c > @@ -203,6 +203,11 @@ BuildTupleHashTableExt(PlanState *parent, > hashtable->hash_iv = 0; > > hashtable->hashtab = tuplehash_create(metacxt, nbuckets, hashtable); > + hashtable->hinstrument.nbuckets_original = nbuckets; > + hashtable->hinstrument.nbuckets = nbuckets; > + hashtable->hinstrument.space_peak = entrysize * hashtable->hashtab->size; That's not actually an accurate accounting of memory, because for filled entries a lot of memory is used to store actual tuples: static TupleHashEntryData * lookup_hash_entry(AggState *aggstate) ... /* find or create the hashtable entry using the filtered tuple */ entry = LookupTupleHashEntry(perhash->hashtable, hashslot, &isnew); if (isnew) { AggStatePerGroup pergroup; int transno; pergroup = (AggStatePerGroup) MemoryContextAlloc(perhash->hashtable->tablecxt, sizeof(AggStatePerGroupData) * aggstate->numtrans); entry->additional = pergroup; since the memory doesn't actually shrink unless the hashtable is destroyed or reset, it'd probably be sensible to compute the memory usage either at reset, or at the end of the query. Greetings, Andres Freund
On Mon, Feb 03, 2020 at 06:53:01AM -0800, Andres Freund wrote: > On 2020-01-03 10:19:26 -0600, Justin Pryzby wrote: > > On Sun, Feb 17, 2019 at 11:29:56AM -0500, Jeff Janes wrote: > > https://www.postgresql.org/message-id/CAMkU%3D1zBJNVo2DGYBgLJqpu8fyjCE_ys%2Bmsr6pOEoiwA7y5jrA%40mail.gmail.com > > > What would I find very useful is [...] if the HashAggregate node under > > > "explain analyze" would report memory and bucket stats; and if the Aggregate > > > node would report...anything. > > Yea, that'd be amazing. It probably should be something every > execGrouping.c using node can opt into. Do you think it should be implemented in execGrouping/TupleHashTableData (as I did) ? I also did an experiment moving into the higher level nodes, but I guess that's not actually desirable. There's currently different output from tests between the implementation using execGrouping.c and the one outside it, so there's at least an issue with grouping sets. > > + hashtable->hinstrument.nbuckets_original = nbuckets; > > + hashtable->hinstrument.nbuckets = nbuckets; > > + hashtable->hinstrument.space_peak = entrysize * hashtable->hashtab->size; > > That's not actually an accurate accounting of memory, because for filled > entries a lot of memory is used to store actual tuples: Thanks - I think I finally understood this. I updated some existing tests to show the new output. I imagine that's a throwaway commit, and should eventually add new tests for each of these node types under explain analyze. I've been testing the various nodes like: --heapscan: DROP TABLE t; CREATE TABLE t (i int unique) WITH(autovacuum_enabled=off); INSERT INTO t SELECT generate_series(1,99999);SET enable_seqscan=off; SET parallel_tuple_cost=0; SET parallel_setup_cost=0; SET enable_indexonlyscan=off;explain analyze verbose SELECT * FROM t WHERE i BETWEEN 999 and 99999999; --setop: explain( analyze,verbose) SELECT * FROM generate_series(1,999) EXCEPT (SELECT NULL UNION ALL SELECT * FROM generate_series(1,99999)); Buckets: 2048 (originally 256) Memory Usage: hashtable: 48kB, tuples: 8Kb --recursive union: explain analyze verbose WITH RECURSIVE t(n) AS ( SELECT 'foo' UNION SELECT n || ' bar' FROM t WHERE length(n) < 9999) SELECTn, n IS OF (text) AS is_text FROM t; --subplan explain analyze verbose SELECT i FROM generate_series(1,999)i WHERE (i,i) NOT IN (SELECT 1,1 UNION ALL SELECT j,j FROM generate_series(1,99999)j); Buckets: 262144 (originally 131072) Memory Usage: hashtable: 6144kB, tuples: 782Kb explain analyze verbose select i FROM generate_series(1,999)i WHERE(1,i) NOT in (select i,null::int from t) ; --Agg: explain (analyze,verbose) SELECT A,COUNT(1) FROM generate_series(1,99999)a GROUP BY 1; Buckets: 262144 (originally 256) Memory Usage: hashtable: 6144kB, tuples: 782Kb explain (analyze, verbose) select i FROM generate_series(1,999)i WHERE(1,1) not in (select a,null from (SELECT generate_series(1,99999)a)x) ; explain analyze verbose select * from (SELECT a FROM generate_series(1,99)a)v left join lateral (select v.a, four, ten, count(*)from (SELECT b four, 2 ten, b FROM generate_series(1,999)b)x group by cube(four,ten)) s on true order by v.a,four,ten; --Grouping sets: explain analyze verbose select unique1, count(two), count(four), count(ten), count(hundred), count(thousand), count(twothousand), count(*) from tenk1 group by grouping sets (unique1,twothousand,thousand,hundred,ten,four,two); -- Justin
Attachment
- v2-0001-Run-some-existing-tests-with-explain-ANALYZE.patch
- v2-0002-explain-to-show-tuplehash-bucket-and-memory-stats.patch
- v2-0003-Gross-hack-to-put-hash-stats-of-subplans-in-the-r.patch
- v2-0004-implement-hash-stats-for-bitmapHeapScan.patch
- v2-0005-Refactor-for-consistency-symmetry.patch
- v2-0006-TupleHashTable.entrysize-was-unused-except-for-in.patch
- v2-0007-Update-comment-obsolete-since-69c3936a.patch
Updated: . remove from explain analyze those tests which would display sort Memory/Disk. Oops. . fix issue with the first patch showing zero "tuples" memory for some grouping sets. . reclassify memory as "tuples" if it has to do with "members". So hashtable size is now redundant with nbuckets (if you know sizeof(TupleHashEntryData)); -- Justin
Attachment
- v3-0001-Run-some-existing-tests-with-explain-ANALYZE.patch
- v3-0002-explain-to-show-tuplehash-bucket-and-memory-stats.patch
- v3-0003-Gross-hack-to-put-hash-stats-of-subplans-in-the-r.patch
- v3-0004-implement-hash-stats-for-bitmapHeapScan.patch
- v3-0005-Refactor-for-consistency-symmetry.patch
- v3-0006-TupleHashTable.entrysize-was-unused-except-for-in.patch
- v3-0007-Update-comment-obsolete-since-69c3936a.patch
On Sun, Feb 16, 2020 at 11:53:07AM -0600, Justin Pryzby wrote: > Updated: > > . remove from explain analyze those tests which would display sort > Memory/Disk. Oops. . Rebased on top of 5b618e1f48aecc66e3a9f60289491da520faae19 . Updated to avoid sort's Disk output, for real this time. . And fixed a syntax error in an intermediate commit. -- Justin
Attachment
- v4-0001-Run-some-existing-tests-with-explain-ANALYZE.patch
- v4-0002-explain-to-show-tuplehash-bucket-and-memory-stats.patch
- v4-0003-Gross-hack-to-put-hash-stats-of-subplans-in-the-r.patch
- v4-0004-implement-hash-stats-for-bitmapHeapScan.patch
- v4-0005-Refactor-for-consistency-symmetry.patch
- v4-0006-TupleHashTable.entrysize-was-unused-except-for-in.patch
- v4-0007-Update-comment-obsolete-since-69c3936a.patch
Hi, I've started looking at this patch, because I've been long missing the information about hashagg hash table, so I'm pleased someone is working on this. In general I agree it may be useful to add simila information to other nodes using a hashtable, but IMHO the hashagg bit is the most useful, so maybe we should focus on it first. A couple of comments after an initial review of the patches: 1) explain.c API The functions in explain.c (even the static ones) follow the convention that the last parameter is (ExplainState *es). I think we should stick to this, so the new parameters should be added before it. For example instead of static void ExplainNode(PlanState *planstate, List *ancestors, const char *relationship, const char *plan_name, ExplainState *es, SubPlanState *subplanstate); we should do e.g. static void ExplainNode(PlanState *planstate, SubPlanState *subplanstate, List *ancestors, const char *relationship, const char *plan_name, ExplainState *es); Also, the first show_grouping_sets should be renamed to aggstate to make it consistent with the type change. 2) The hash_instrumentation is a bit inconsistent with what we already have. We have Instrumentation, JitInstrumentation, WorkerInstrumentation and HashInstrumentation, so I suggest we follow the same patter and call this HashTableInstrumentation or something like that. 3) Almost all executor nodes that are modified to include this new instrumentation struct also include TupleHashTable, and the data are essentially about the hash table. So my question is why not to include this into TupleHashTable - that would mean we don't need to modify any executor nodes, and it'd probably simplify code in explain.c too because we could simply pass the hashtable. It'd also simplify e.g. SubPlanState where we have to add two new fields and make sure to update the right one. 4) The one exception to (3) is BitmapHeapScanState, which does include TIDBitmap and not TupleHashTable. And then we have tbm_instrumentation which "fakes" the data based on the pagetable. Maybe this is a sign that TIDBitmap needs a slightly different struct? Also, I'm not sure why we actually need tbm_instrumentation()? It just copies the instrumentation data from TIDBitmap into the node level, but why couldn't we just look at the instrumentation data in TIDBitmap directly? 5) I think the explain for grouping sets need a rething. Teh function show_grouping_set_keys was originally meant to print just the keys, but now it's also printing the hash table stats. IMO we need a new function printing a grouping set info - calling show_grouping_set_keys to print the keys, but then also printing the extra hashtable info. 6) subplan explain I probably agree the hashtable info should be included in the subplan, not in the parent node. Otherwise it's confusing, particularly when the node has multiple subplans. The one thing I find a bit strange is this: explain (analyze, timing off, summary off, costs off) select 'foo'::text in (select 'bar'::name union all select 'bar'::name); QUERY PLAN ---------------------------------------------- Result (actual rows=1 loops=1) SubPlan 1 Buckets: 4 (originally 2) Null hashtable: Buckets: 2 -> Append (actual rows=2 loops=1) -> Result (actual rows=1 loops=1) -> Result (actual rows=1 loops=1) (7 rows) That is, there's no indication why would this use a hash table, because the "hashed subplan" is included only in verbose mode: explain (analyze, verbose, timing off, summary off, costs off) select 'foo'::text in (select 'bar'::name union all select 'bar'::name); QUERY PLAN ---------------------------------------------------------------------------------- Result (actual rows=1 loops=1) Output: (hashed SubPlan 1) SubPlan 1 Buckets: 4 (originally 2) Memory Usage Hash: 1kB Memory Usage Tuples: 1kB Null hashtable: Buckets: 2 Memory Usage Hash: 1kB Memory Usage Tuples: 0kB -> Append (actual rows=2 loops=1) -> Result (actual rows=1 loops=1) Output: 'bar'::name -> Result (actual rows=1 loops=1) Output: 'bar'::name (10 rows) Not sure if this is an issue, maybe it's fine. But it's definitely strange that we only print memory info in verbose mode - IMHO it's much more useful info than the number of buckets etc. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Feb 22, 2020 at 10:53:35PM +0100, Tomas Vondra wrote: > I've started looking at this patch, because I've been long missing the Thanks for looking I have brief, initial comments before I revisit the patch. > 3) Almost all executor nodes that are modified to include this new > instrumentation struct also include TupleHashTable, and the data are > essentially about the hash table. So my question is why not to include > this into TupleHashTable - that would mean we don't need to modify any > executor nodes, and it'd probably simplify code in explain.c too because > we could simply pass the hashtable. I considered this. From 0004 commit message: | Also, if instrumentation were implemented in simplehash.h, I think every | insertion or deletion would need to check ->members and ->size (which isn't | necessary for Agg, but is necessary in the general case, and specifically for | tidbitmap, since it actually DELETEs hashtable entries). Or else simplehash | would need a new function like UpdateTupleHashStats, which the higher level nodes | would need to call after filling the hashtable or before deleting tuples, which | seems to defeat the purpose of implementing stats at a lower layer. > 4) The one exception to (3) is BitmapHeapScanState, which does include > TIDBitmap and not TupleHashTable. And then we have tbm_instrumentation > which "fakes" the data based on the pagetable. Maybe this is a sign that > TIDBitmap needs a slightly different struct? Hm, I'd say that it "collects" the data that's not immediately present, not fake it. But maybe I did it poorly. Also, maybe TIDBitmap shouldn't be included in the patch.. > Also, I'm not sure why we > actually need tbm_instrumentation()? It just copies the instrumentation > data from TIDBitmap into the node level, but why couldn't we just look > at the instrumentation data in TIDBitmap directly? See 0004 commit message: | TIDBitmap is a private structure, so add an accessor function to return its | instrumentation, and duplicate instrumentation struct in BitmapHeapState. Also, I don't know what anyone else thinks, but I think 0005 is a throwaway commit. It's implemented more nicely in execGrouping.c. > But it's definitely strange that we only print memory info in verbose mode - > IMHO it's much more useful info than the number of buckets etc. Because I wanted to be able to put "explain analyze" into regression tests (which can show: "Buckets: 4 (originally 2)"). But cannot get stable output for any plan which uses Sort, without hacks like explain_sq_limit and explain_parallel_sort_stats. Actually, I wish there were a way to control Sort nodes' Memory/Disk output, too. I'm sure most of regression tests were meant to be run as explain(analyze NO), but it'd be much better if analyze YES were reasonably easy in the general case that might include Sort. If someone seconds that, I will start a separate thread. -- Justin Pryzby
On Sat, Feb 22, 2020 at 10:53:35PM +0100, Tomas Vondra wrote: > 1) explain.c API > > The functions in explain.c (even the static ones) follow the convention > that the last parameter is (ExplainState *es). I think we should stick > to this, so the new parameters should be added before it. I found it weird to have the "constant" arguments at the end rather than at the beginning. (Also, these don't follow that convention: show_buffer_usage ExplainSaveGroup ExplainRestoreGroup ExplainOneQuery ExplainPrintJIT). But done. > Also, the first show_grouping_sets should be renamed to aggstate to make > it consistent with the type change. The prototype wasn't updated - fixed. > 2) The hash_instrumentation is a bit inconsistent with what we already > have ..HashTableInstrumentation.. Thanks for thinking of a better name. > 5) I think the explain for grouping sets need a rething. Teh function > show_grouping_set_keys was originally meant to print just the keys, but > now it's also printing the hash table stats. IMO we need a new function > printing a grouping set info - calling show_grouping_set_keys to print > the keys, but then also printing the extra hashtable info. I renamed it, and did the rest in a separate patch for now, since I'm only partially convinced it's an improvement. > 6) subplan explain > > That is, there's no indication why would this use a hash table, because > the "hashed subplan" is included only in verbose mode: Need to think about that.. > Not sure if this is an issue, maybe it's fine. But it's definitely > strange that we only print memory info in verbose mode - IMHO it's much > more useful info than the number of buckets etc. You're right that verbose isn't right for this. I wrote patches creating new explain options to allow stable output of "explain analyze", by avoiding Memory/Disk. The only other way to handle it seems to be to avoid "explain analyze" in regression tests, which is what's in common practice anyway, so did that instead. I also fixed wrong output and wrong non-text formatting for grouping sets, tweaked output for subplan, and broke style rules less often. -- Justin
Attachment
- v5-0001-explain-to-show-tuplehash-bucket-and-memory-stats.patch
- v5-0002-refactor-show_grouping_set_keys.patch
- v5-0003-Gross-hack-to-put-hash-stats-of-subplans-in-the-r.patch
- v5-0004-implement-hash-stats-for-bitmapHeapScan.patch
- v5-0005-Refactor-for-consistency-symmetry.patch
- v5-0006-TupleHashTable.entrysize-was-unused-except-for-in.patch
- v5-0007-Update-comment-obsolete-since-69c3936a.patch
Updated for new tests in 58c47ccfff20b8c125903482725c1dbfd30beade and rebased.
Attachment
- v6-0001-explain-to-show-tuplehash-bucket-and-memory-stats.patch
- v6-0002-refactor-show_grouping_set_keys.patch
- v6-0003-Gross-hack-to-put-hash-stats-of-subplans-in-the-r.patch
- v6-0004-implement-hash-stats-for-bitmapHeapScan.patch
- v6-0005-Refactor-for-consistency-symmetry.patch
- v6-0006-TupleHashTable.entrysize-was-unused-except-for-in.patch
- v6-0007-Update-comment-obsolete-since-69c3936a.patch
Hi, On 2020-03-01 09:45:40 -0600, Justin Pryzby wrote: > +/* > + * Show hash bucket stats and (optionally) memory. > + */ > +static void > +show_tuplehash_info(HashTableInstrumentation *inst, ExplainState *es) > +{ > + long spacePeakKb_tuples = (inst->space_peak_tuples + 1023) / 1024, > + spacePeakKb_hash = (inst->space_peak_hash + 1023) / 1024; Let's not add further uses of long. It's terrible because it has different widths on 64bit windows and everything else. Specify the widths explicitly, or use something like size_t. > + if (es->format != EXPLAIN_FORMAT_TEXT) > + { > + ExplainPropertyInteger("Hash Buckets", NULL, > + inst->nbuckets, es); > + ExplainPropertyInteger("Original Hash Buckets", NULL, > + inst->nbuckets_original, es); > + ExplainPropertyInteger("Peak Memory Usage (hashtable)", "kB", > + spacePeakKb_hash, es); > + ExplainPropertyInteger("Peak Memory Usage (tuples)", "kB", > + spacePeakKb_tuples, es); And then you're passing the long to ExplainPropertyInteger which accepts a int64, making the use of long above particularly suspicious. I wonder if it would make sense to add a ExplainPropertyBytes(), that would do the rounding etc automatically. It could then also switch units as appropriate. > + } > + else if (!inst->nbuckets) > + ; /* Do nothing */ > + else > + { > + if (inst->nbuckets_original != inst->nbuckets) > + { > + ExplainIndentText(es); > + appendStringInfo(es->str, > + "Buckets: %ld (originally %ld)", > + inst->nbuckets, > + inst->nbuckets_original); > + } > + else > + { > + ExplainIndentText(es); > + appendStringInfo(es->str, > + "Buckets: %ld", > + inst->nbuckets); > + } > + > + if (es->analyze) > + appendStringInfo(es->str, > + " Memory Usage: hashtable: %ldkB, tuples: %ldkB", > + spacePeakKb_hash, spacePeakKb_tuples); > + appendStringInfoChar(es->str, '\n'); I'm not sure I like the alternative output formats here. All the other fields are separated with a comma, but the original size is in parens. I'd probably just format it as "Buckets: %lld " and then add ", Original Buckets: %lld" when differing. Also, %ld is problematic, because it's only 32bit wide on some platforms (including 64bit windows), but 64bit on others. The easiest way to deal with that is to use %lld and cast the argument to long long - yes that's a sad workaround. > +/* Update instrumentation stats */ > +void > +UpdateTupleHashTableStats(TupleHashTable hashtable, bool initial) > +{ > + hashtable->instrument.nbuckets = hashtable->hashtab->size; > + if (initial) > + { > + hashtable->instrument.nbuckets_original = hashtable->hashtab->size; > + hashtable->instrument.space_peak_hash = hashtable->hashtab->size * > + sizeof(TupleHashEntryData); > + hashtable->instrument.space_peak_tuples = 0; > + } > + else > + { > +#define maxself(a,b) a=Max(a,b) > + /* hashtable->entrysize includes additionalsize */ > + maxself(hashtable->instrument.space_peak_hash, > + hashtable->hashtab->size * sizeof(TupleHashEntryData)); > + maxself(hashtable->instrument.space_peak_tuples, > + hashtable->hashtab->members * hashtable->entrysize); > +#undef maxself > + } > +} Not a fan of this macro. I'm also not sure I understand what you're trying to do here? > diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out > index f457b5b150..b173b32cab 100644 > --- a/src/test/regress/expected/aggregates.out > +++ b/src/test/regress/expected/aggregates.out > @@ -517,10 +517,11 @@ order by 1, 2; > -> HashAggregate > Output: s2.s2, sum((s1.s1 + s2.s2)) > Group Key: s2.s2 > + Buckets: 4 > -> Function Scan on pg_catalog.generate_series s2 > Output: s2.s2 > Function Call: generate_series(1, 3) > -(14 rows) > +(15 rows) These tests probably won't be portable. The number of hash buckets calculated will e.g. depend onthe size of the contained elements. And that'll e.g. will depend on whether pointers are 4 or 8 bytes. Greetings, Andres Freund
On Fri, Mar 06, 2020 at 09:58:59AM -0800, Andres Freund wrote: > ... > >> + } >> + else if (!inst->nbuckets) >> + ; /* Do nothing */ >> + else >> + { >> + if (inst->nbuckets_original != inst->nbuckets) >> + { >> + ExplainIndentText(es); >> + appendStringInfo(es->str, >> + "Buckets: %ld (originally %ld)", >> + inst->nbuckets, >> + inst->nbuckets_original); >> + } >> + else >> + { >> + ExplainIndentText(es); >> + appendStringInfo(es->str, >> + "Buckets: %ld", >> + inst->nbuckets); >> + } >> + >> + if (es->analyze) >> + appendStringInfo(es->str, >> + " Memory Usage: hashtable: %ldkB, tuples: %ldkB", >> + spacePeakKb_hash, spacePeakKb_tuples); >> + appendStringInfoChar(es->str, '\n'); > >I'm not sure I like the alternative output formats here. All the other >fields are separated with a comma, but the original size is in >parens. I'd probably just format it as "Buckets: %lld " and then add >", Original Buckets: %lld" when differing. > FWIW this copies hashjoin precedent, which does this: appendStringInfo(es->str, "Buckets: %d (originally %d) Batches: %d (originally %d) Memory Usage: %ldkB\n", hinstrument.nbuckets, ... I agree it's not ideal, but maybe let's not invent new ways to format the same type of info. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Mar 06, 2020 at 09:58:59AM -0800, Andres Freund wrote: > > + ExplainIndentText(es); > > + appendStringInfo(es->str, > > + "Buckets: %ld (originally %ld)", > > + inst->nbuckets, > > + inst->nbuckets_original); > > I'm not sure I like the alternative output formats here. All the other > fields are separated with a comma, but the original size is in > parens. I'd probably just format it as "Buckets: %lld " and then add > ", Original Buckets: %lld" when differing. It's done that way for consistency with hashJoin in show_hash_info(). > > +/* Update instrumentation stats */ > > +void > > +UpdateTupleHashTableStats(TupleHashTable hashtable, bool initial) > > +{ > > + hashtable->instrument.nbuckets = hashtable->hashtab->size; > > + if (initial) > > + { > > + hashtable->instrument.nbuckets_original = hashtable->hashtab->size; > > + hashtable->instrument.space_peak_hash = hashtable->hashtab->size * > > + sizeof(TupleHashEntryData); > > + hashtable->instrument.space_peak_tuples = 0; > > + } > > + else > > + { > > +#define maxself(a,b) a=Max(a,b) > > + /* hashtable->entrysize includes additionalsize */ > > + maxself(hashtable->instrument.space_peak_hash, > > + hashtable->hashtab->size * sizeof(TupleHashEntryData)); > > + maxself(hashtable->instrument.space_peak_tuples, > > + hashtable->hashtab->members * hashtable->entrysize); > > +#undef maxself > > + } > > +} > > Not a fan of this macro. > > I'm also not sure I understand what you're trying to do here? I have to call UpdateTupleHashTableStats() from the callers at deliberate locations. If the caller fills the hashtable all at once, I can populate the stats immediately after that, but if it's populated incrementally, then need to update stats right before it's destroyed or reset, otherwise we can show tuple size of the hashtable since its most recent reset, rather than a larger, previous incarnation. > > diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out > > index f457b5b150..b173b32cab 100644 > > --- a/src/test/regress/expected/aggregates.out > > +++ b/src/test/regress/expected/aggregates.out > > @@ -517,10 +517,11 @@ order by 1, 2; > > -> HashAggregate > > Output: s2.s2, sum((s1.s1 + s2.s2)) > > Group Key: s2.s2 > > + Buckets: 4 > > -> Function Scan on pg_catalog.generate_series s2 > > Output: s2.s2 > > Function Call: generate_series(1, 3) > > -(14 rows) > > +(15 rows) > > These tests probably won't be portable. The number of hash buckets > calculated will e.g. depend onthe size of the contained elements. And > that'll e.g. will depend on whether pointers are 4 or 8 bytes. I was aware and afraid of that. Previously, I added this output only to "explain analyze", and (as an quick, interim implementation) changed various tests to use analyze, and memory only shown in "verbose" mode. But as Tomas pointed out, that's consistent with what's done elsewhere. So is the solution to show stats only during explain ANALYZE ? Or ... I have a patch to create a new explain(MACHINE) option to allow more stable output, by avoiding Memory/Disk. That doesn't attempt to make all "explain analyze" output stable - there's other issues, I think mostly related to parallel workers (see 4ea03f3f, 13e8b2ee). But does allow retiring explain_sq_limit and explain_parallel_sort_stats. I'm including my patch to show what I mean, but I didn't enable it for hashtable "Buckets:". I guess in either case, the tests shouldn't be included. -- Justin
Attachment
- v7-0001-explain-to-show-tuplehash-bucket-and-memory-stats.patch
- v7-0002-refactor-show_grouping_set_keys.patch
- v7-0003-Gross-hack-to-put-hash-stats-of-subplans-in-the-r.patch
- v7-0004-implement-hash-stats-for-bitmapHeapScan.patch
- v7-0005-Refactor-for-consistency-symmetry.patch
- v7-0006-TupleHashTable.entrysize-was-unused-except-for-in.patch
- v7-0007-Update-comment-obsolete-since-69c3936a.patch
- v7-0008-Add-explain-MACHINE.patch
- v7-0009-Add-explain-REGRESS.patch
Hi, On 2020-03-06 15:33:10 -0600, Justin Pryzby wrote: > On Fri, Mar 06, 2020 at 09:58:59AM -0800, Andres Freund wrote: > > > + ExplainIndentText(es); > > > + appendStringInfo(es->str, > > > + "Buckets: %ld (originally %ld)", > > > + inst->nbuckets, > > > + inst->nbuckets_original); > > > > I'm not sure I like the alternative output formats here. All the other > > fields are separated with a comma, but the original size is in > > parens. I'd probably just format it as "Buckets: %lld " and then add > > ", Original Buckets: %lld" when differing. > > It's done that way for consistency with hashJoin in show_hash_info(). Fair. I don't like it, but it's not this patch's fault. > > > diff --git a/src/test/regress/expected/aggregates.out b/src/test/regress/expected/aggregates.out > > > index f457b5b150..b173b32cab 100644 > > > --- a/src/test/regress/expected/aggregates.out > > > +++ b/src/test/regress/expected/aggregates.out > > > @@ -517,10 +517,11 @@ order by 1, 2; > > > -> HashAggregate > > > Output: s2.s2, sum((s1.s1 + s2.s2)) > > > Group Key: s2.s2 > > > + Buckets: 4 > > > -> Function Scan on pg_catalog.generate_series s2 > > > Output: s2.s2 > > > Function Call: generate_series(1, 3) > > > -(14 rows) > > > +(15 rows) > > > > These tests probably won't be portable. The number of hash buckets > > calculated will e.g. depend onthe size of the contained elements. And > > that'll e.g. will depend on whether pointers are 4 or 8 bytes. > > I was aware and afraid of that. Previously, I added this output only to > "explain analyze", and (as an quick, interim implementation) changed various > tests to use analyze, and memory only shown in "verbose" mode. But as Tomas > pointed out, that's consistent with what's done elsewhere. > > So is the solution to show stats only during explain ANALYZE ? > > Or ... I have a patch to create a new explain(MACHINE) option to allow more > stable output, by avoiding Memory/Disk. That doesn't attempt to make all > "explain analyze" output stable - there's other issues, I think mostly related > to parallel workers (see 4ea03f3f, 13e8b2ee). But does allow retiring > explain_sq_limit and explain_parallel_sort_stats. I'm including my patch to > show what I mean, but I didn't enable it for hashtable "Buckets:". I guess in > either case, the tests shouldn't be included. Yea, there's been recent discussion about an argument like that. See e.g. https://www.postgresql.org/message-id/18494.1580079189%40sss.pgh.pa.us Greetings, Andres Freund
+ /* hashtable->entrysize includes additionalsize */ + hashtable->instrument.space_peak_hash = Max( + hashtable->instrument.space_peak_hash, + hashtable->hashtab->size * sizeof(TupleHashEntryData)); + + hashtable->instrument.space_peak_tuples = Max( + hashtable->instrument.space_peak_tuples, + hashtable->hashtab->members * hashtable->entrysize); I think, in general, we should avoid estimates/projections for reporting and try to get at a real number, like MemoryContextMemAllocated(). (Aside: I may want to tweak exactly what that function reports so that it doesn't count the unused portion of the last block.) For instance, the report is still not accurate, because it doesn't account for pass-by-ref transition state values. To use memory-context-based reporting, it's hard to make the stats a part of the tuple hash table, because the tuple hash table doesn't own the memory contexts (they are passed in). It's also hard to make it per-hashtable (e.g. for grouping sets), unless we put each grouping set in its own memory context. Also, is there a reason you report two different memory values (hashtable and tuples)? I don't object, but it seems like a little too much detail. Regards, Jeff Davis
Hi, On 2020-03-13 10:15:46 -0700, Jeff Davis wrote: > Also, is there a reason you report two different memory values > (hashtable and tuples)? I don't object, but it seems like a little too > much detail. Seems useful to me - the hashtable is pre-allocated based on estimates, whereas the tuples are allocated "on demand". So seeing the difference will allow to investigate the more crucial issue... Greetings, Andres Freund
On Fri, 2020-03-13 at 10:27 -0700, Andres Freund wrote: > On 2020-03-13 10:15:46 -0700, Jeff Davis wrote: > > Also, is there a reason you report two different memory values > > (hashtable and tuples)? I don't object, but it seems like a little > > too > > much detail. > > Seems useful to me - the hashtable is pre-allocated based on > estimates, > whereas the tuples are allocated "on demand". So seeing the > difference > will allow to investigate the more crucial issue... Then do we also want to report separately on the by-ref transition values? That could be useful if you are using ARRAY_AGG and the states grow larger than you might expect. Regards, Jeff Davis
Hi, On 2020-03-13 10:53:17 -0700, Jeff Davis wrote: > On Fri, 2020-03-13 at 10:27 -0700, Andres Freund wrote: > > On 2020-03-13 10:15:46 -0700, Jeff Davis wrote: > > > Also, is there a reason you report two different memory values > > > (hashtable and tuples)? I don't object, but it seems like a little > > > too > > > much detail. > > > > Seems useful to me - the hashtable is pre-allocated based on > > estimates, > > whereas the tuples are allocated "on demand". So seeing the > > difference > > will allow to investigate the more crucial issue... > > Then do we also want to report separately on the by-ref transition > values? That could be useful if you are using ARRAY_AGG and the states > grow larger than you might expect. I can see that being valuable - I've had to debug cases with too much memory being used due to aggregate transitions before. Right now it'd be mixed in with tuples, I believe - and we'd need a separate context for tracking the transition values? Due to that I'm inclined to not report separately for now. Greetings, Andres Freund
On Fri, Mar 13, 2020 at 10:57:43AM -0700, Andres Freund wrote: > On 2020-03-13 10:53:17 -0700, Jeff Davis wrote: > > On Fri, 2020-03-13 at 10:27 -0700, Andres Freund wrote: > > > On 2020-03-13 10:15:46 -0700, Jeff Davis wrote: > > > > Also, is there a reason you report two different memory values > > > > (hashtable and tuples)? I don't object, but it seems like a little too > > > > much detail. > > > > > > Seems useful to me - the hashtable is pre-allocated based on estimates, > > > whereas the tuples are allocated "on demand". So seeing the difference > > > will allow to investigate the more crucial issue... > > Then do we also want to report separately on the by-ref transition > > values? That could be useful if you are using ARRAY_AGG and the states > > grow larger than you might expect. > > I can see that being valuable - I've had to debug cases with too much > memory being used due to aggregate transitions before. Right now it'd be > mixed in with tuples, I believe - and we'd need a separate context for > tracking the transition values? Due to that I'm inclined to not report > separately for now. I think that's already in a separate context indexed by grouping set: src/include/nodes/execnodes.h: ExprContext **aggcontexts; /* econtexts for long-lived data (per GS) */ But the hashtable and tuples are combined. I put them in separate contexts and rebased on top of 1f39bce021540fde00990af55b4432c55ef4b3c7. But didn't do anything yet with the aggcontexts. Now I can get output like: |template1=# explain analyze SELECT i,COUNT(1) FROM t GROUP BY 1; | HashAggregate (cost=4769.99..6769.98 rows=199999 width=12) (actual time=266.465..27020.333 rows=199999 loops=1) | Group Key: i | Buckets: 524288 (originally 262144) | Peak Memory Usage: hashtable: 12297kB, tuples: 24576kB | Disk Usage: 192 kB | HashAgg Batches: 3874 | -> Seq Scan on t (cost=0.00..3769.99 rows=199999 width=4) (actual time=13.043..64.017 rows=199999 loops=1) It looks somewhat funny next to hash join, which puts everything on one line: |template1=# explain analyze SELECT i,COUNT(1) FROM t a JOIN t b USING(i) GROUP BY 1; | HashAggregate (cost=13789.95..15789.94 rows=199999 width=12) (actual time=657.733..27129.873 rows=199999 loops=1) | Group Key: a.i | Buckets: 524288 (originally 262144) | Peak Memory Usage: hashtable: 12297kB, tuples: 24576kB | Disk Usage: 192 kB | HashAgg Batches: 3874 | -> Hash Join (cost=6269.98..12789.95 rows=199999 width=4) (actual time=135.932..426.071 rows=199999 loops=1) | Hash Cond: (a.i = b.i) | -> Seq Scan on t a (cost=0.00..3769.99 rows=199999 width=4) (actual time=3.265..47.598 rows=199999 loops=1) | -> Hash (cost=3769.99..3769.99 rows=199999 width=4) (actual time=131.881..131.882 rows=199999 loops=1) | Buckets: 262144 Batches: 1 Memory Usage: 9080kB | -> Seq Scan on t b (cost=0.00..3769.99 rows=199999 width=4) (actual time=3.273..40.163 rows=199999 loops=1) -- Justin
Attachment
- v8-0001-nodeAgg-separate-context-for-each-hashtable.patch
- v8-0002-explain-to-show-tuplehash-bucket-and-memory-stats.patch
- v8-0003-refactor-show_grouping_set_keys.patch
- v8-0004-Gross-hack-to-put-hash-stats-of-subplans-in-the-r.patch
- v8-0005-implement-hash-stats-for-bitmapHeapScan.patch
- v8-0006-Refactor-for-consistency-symmetry.patch
- v8-0007-TupleHashTable.entrysize-was-unused-except-for-in.patch
- v8-0008-Update-comment-obsolete-since-69c3936a.patch
On Fri, Mar 20, 2020 at 03:44:42AM -0500, Justin Pryzby wrote: > On Fri, Mar 13, 2020 at 10:57:43AM -0700, Andres Freund wrote: > > On 2020-03-13 10:53:17 -0700, Jeff Davis wrote: > > > On Fri, 2020-03-13 at 10:27 -0700, Andres Freund wrote: > > > > On 2020-03-13 10:15:46 -0700, Jeff Davis wrote: > > > > > Also, is there a reason you report two different memory values > > > > > (hashtable and tuples)? I don't object, but it seems like a little too > > > > > much detail. > > > > > > > > Seems useful to me - the hashtable is pre-allocated based on estimates, > > > > whereas the tuples are allocated "on demand". So seeing the difference > > > > will allow to investigate the more crucial issue... > > > > Then do we also want to report separately on the by-ref transition > > > values? That could be useful if you are using ARRAY_AGG and the states > > > grow larger than you might expect. > > > > I can see that being valuable - I've had to debug cases with too much > > memory being used due to aggregate transitions before. Right now it'd be > > mixed in with tuples, I believe - and we'd need a separate context for > > tracking the transition values? Due to that I'm inclined to not report > > separately for now. > > I think that's already in a separate context indexed by grouping set: > src/include/nodes/execnodes.h: ExprContext **aggcontexts; /* econtexts for long-lived data (per GS) */ > > But the hashtable and tuples are combined. I put them in separate contexts and > rebased on top of 1f39bce021540fde00990af55b4432c55ef4b3c7. I forgot to say that I'd also switched started using memory context based accounting. 90% of the initial goal of this patch was handled by instrumentation added by "hash spill to disk" (1f39bce02), but this *also* adds: - separate accounting for tuples vs hashtable; - number of hash buckets; - handles other agg nodes, and bitmap scan; Should I continue pursuing this patch? Does it still serve any significant purpose? template1=# explain (analyze, costs off, summary off) SELECT a, COUNT(1) FROM generate_series(1,999999) a GROUP BY 1 ; HashAggregate (actual time=1070.713..2287.011 rows=999999 loops=1) Group Key: a Buckets: 32768 (originally 512) Peak Memory Usage: hashtable: 777kB, tuples: 4096kB Disk Usage: 22888 kB HashAgg Batches: 84 -> Function Scan on generate_series a (actual time=238.270..519.832 rows=999999 loops=1) template1=# explain analyze SELECT * FROM t WHERE a BETWEEN 999 AND 99999; Bitmap Heap Scan on t (cost=4213.01..8066.67 rows=197911 width=4) (actual time=26.803..84.693 rows=198002 loops=1) Recheck Cond: ((a >= 999) AND (a <= 99999)) Heap Blocks: exact=878 Buckets: 1024 (originally 256) Peak Memory Usage: hashtable: 48kB, tuples: 4kB template1=# explain analyze SELECT generate_series(1,99999) EXCEPT SELECT generate_series(1,999); HashSetOp Except (cost=0.00..2272.49 rows=99999 width=8) (actual time=135.986..174.656 rows=99000 loops=1) Buckets: 262144 (originally 131072) Peak Memory Usage: hashtable: 6177kB, tuples: 8192kB @cfbot: rebased
Attachment
- v9-0001-nodeAgg-separate-context-for-each-hashtable.patch
- v9-0002-explain-to-show-tuplehash-bucket-and-memory-stats.patch
- v9-0003-refactor-show_grouping_set_keys.patch
- v9-0004-Gross-hack-to-put-hash-stats-of-subplans-in-the-r.patch
- v9-0005-implement-hash-stats-for-bitmapHeapScan.patch
- v9-0006-Refactor-for-consistency-symmetry.patch
- v9-0007-TupleHashTable.entrysize-was-unused-except-for-in.patch
- v9-0008-Update-comment-obsolete-since-69c3936a.patch
On Wed, 2020-04-08 at 16:00 -0500, Justin Pryzby wrote: > 90% of the initial goal of this patch was handled by instrumentation > added by > "hash spill to disk" (1f39bce02), but this *also* adds: > > - separate accounting for tuples vs hashtable; > - number of hash buckets; > - handles other agg nodes, and bitmap scan; > > Should I continue pursuing this patch? > Does it still serve any significant purpose? Those things would be useful for me trying to tune the performance and cost model. I think we need to put some of these things under "VERBOSE" or maybe invent a new explain option to provide this level of detail, though. Regards, Jeff Davis
> On 9 Apr 2020, at 00:24, Jeff Davis <pgsql@j-davis.com> wrote: > > On Wed, 2020-04-08 at 16:00 -0500, Justin Pryzby wrote: >> 90% of the initial goal of this patch was handled by instrumentation >> added by >> "hash spill to disk" (1f39bce02), but this *also* adds: >> >> - separate accounting for tuples vs hashtable; >> - number of hash buckets; >> - handles other agg nodes, and bitmap scan; >> >> Should I continue pursuing this patch? >> Does it still serve any significant purpose? > > Those things would be useful for me trying to tune the performance and > cost model. I think we need to put some of these things under "VERBOSE" > or maybe invent a new explain option to provide this level of detail, > though. This thread has stalled and the patch has been Waiting on Author since March, and skimming the thread there seems to be questions raised over the value proposition. Is there progress happening behind the scenes or should we close this entry for now, to re-open in case there is renewed activity/interest? cheers ./daniel
> On 12 Jul 2020, at 21:52, Daniel Gustafsson <daniel@yesql.se> wrote: > This thread has stalled and the patch has been Waiting on Author since March, > and skimming the thread there seems to be questions raised over the value > proposition. Is there progress happening behind the scenes or should we close > this entry for now, to re-open in case there is renewed activity/interest? With not too many days of the commitfest left, I'm closing this in 2020-07. Please feel free to add a new entry if there is renewed interest in this patch. cheers ./daniel