Thread: Disk-based hash aggregate's cost model
We have a Postgres 13 open item for Disk-based hash aggregate, which is the only non-trivial open item. There is a general concern about how often we get disk-based hash aggregation when work_mem is particularly low, and recursion seems unavoidable. This is generally thought to be a problem in the costing. Tomas Vondra did some testing of the patch which led to discussion of this on the hash agg GUC megathread, here: https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development I am starting this new thread to deal with the open item. Any progress on this, Jeff? Please correct or expand on my summary of the open item if I got something wrong. Thanks -- Peter Geoghegan
On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: > We have a Postgres 13 open item for Disk-based hash aggregate, which > is the only non-trivial open item. There is a general concern about > how often we get disk-based hash aggregation when work_mem is > particularly low, and recursion seems unavoidable. This is generally > thought to be a problem in the costing. We discussed two approaches to tweaking the cost model: 1. Penalize HashAgg disk costs by a constant amount. It seems to be chosen a little too often, and we can reduce the number of plan changes. 2. Treat recursive HashAgg spilling skeptically, and heavily penalize recursive spilling. The problem with approach #2 is that we have a default hash mem of 4MB, and real systems have a lot more than that. In this scenario, recursive spilling can beat Sort by a lot. For instance: Data: create table text10m(t text collate "C.UTF-8", i int, n numeric); insert into t10m select s.g::text, s.g, s.g::numeric from ( select (random()*1000000000)::int as g from generate_series(1,10000000)) s; vacuum (freeze,analyze) text10m; Query: explain analyze select distinct t from text10m; HashAgg: 10.5s Sort+Distinct: 46s I'm inclined toward option #1 for simplicity unless you feel strongly about option #2. Specifically, I was thinking of a 1.5X penalty for HashAgg disk costs. Regards, Jeff Davis
On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote: >On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: >> We have a Postgres 13 open item for Disk-based hash aggregate, which >> is the only non-trivial open item. There is a general concern about >> how often we get disk-based hash aggregation when work_mem is >> particularly low, and recursion seems unavoidable. This is generally >> thought to be a problem in the costing. > >We discussed two approaches to tweaking the cost model: > >1. Penalize HashAgg disk costs by a constant amount. It seems to be >chosen a little too often, and we can reduce the number of plan >changes. > >2. Treat recursive HashAgg spilling skeptically, and heavily penalize >recursive spilling. > >The problem with approach #2 is that we have a default hash mem of 4MB, >and real systems have a lot more than that. In this scenario, recursive >spilling can beat Sort by a lot. > I think the main issue is that we're mostly speculating what's wrong. I've shared some measurements and symptoms, and we've discussed what might be causing it, but I'm not really sure we know for sure. I really dislike (1) because it seems more like "We don't know what's wrong so we'll penalize hashagg," kind of solution. A much more principled solution would be to tweak the costing accordingly, not just by adding some constant. For (2) it really depends if recursive spilling is really the problem here. In the examples I shared, the number of partitions/batches was very different, but the query duration was mostly independent (almost constant). FWIW I still haven't seen any explanation why the current code spills more data than the CP_SMALL_TLIST patch (which was reverted). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Aug 30, 2020 at 02:26:20AM +0200, Tomas Vondra wrote: >On Fri, Aug 28, 2020 at 06:32:38PM -0700, Jeff Davis wrote: >>On Thu, 2020-08-27 at 17:28 -0700, Peter Geoghegan wrote: >>>We have a Postgres 13 open item for Disk-based hash aggregate, which >>>is the only non-trivial open item. There is a general concern about >>>how often we get disk-based hash aggregation when work_mem is >>>particularly low, and recursion seems unavoidable. This is generally >>>thought to be a problem in the costing. >> >>We discussed two approaches to tweaking the cost model: >> >>1. Penalize HashAgg disk costs by a constant amount. It seems to be >>chosen a little too often, and we can reduce the number of plan >>changes. >> >>2. Treat recursive HashAgg spilling skeptically, and heavily penalize >>recursive spilling. >> >>The problem with approach #2 is that we have a default hash mem of 4MB, >>and real systems have a lot more than that. In this scenario, recursive >>spilling can beat Sort by a lot. >> > >I think the main issue is that we're mostly speculating what's wrong. >I've shared some measurements and symptoms, and we've discussed what >might be causing it, but I'm not really sure we know for sure. > >I really dislike (1) because it seems more like "We don't know what's >wrong so we'll penalize hashagg," kind of solution. A much more >principled solution would be to tweak the costing accordingly, not just >by adding some constant. For (2) it really depends if recursive spilling >is really the problem here. In the examples I shared, the number of >partitions/batches was very different, but the query duration was >mostly independent (almost constant). > I've decided to look at the costing a bit more closely today, and see why the costing is so much lower compared to sort/groupagg. I've used the same 32GB dataset and query as in [1]. I've repeated tests for all the work_mem values, and I see the number of batches are much lower, probably thanks to the HLL improvement: 2MB Planned: 64 Batches (old): 4160 Batches: 2977 4MB Planned: 128 Batches (old): 16512 Batches: 1569 8MB Planned: 256 Batches (old): 21488 Batches: 1289 64MB Planned: 32 Batches (old): 2720 Batches: 165 256MB Planned: 8 Batches (old): 8 Batches: 41 The impact on duration of the queries seems pretty negligible, though. The plans with work_mem=64MB look like this: 1) hashagg QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=11267293.86..11267293.86 rows=1 width=36) (actual time=213127.515..213127.517 rows=0 loops=1) -> HashAggregate (cost=10229061.10..11267293.86 rows=6718533 width=36) (actual time=100862.623..212948.642 rows=6400000loops=1) Group Key: lineitem.l_partkey Planned Partitions: 32 Batches: 165 Memory Usage: 67857kB Disk Usage: 6802432kB -> Seq Scan on lineitem (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.418..20344.631 rows=192000551loops=1) Planning Time: 0.064 ms Execution Time: 213441.986 ms (7 rows) 2) groupagg QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------- Limit (cost=36755617.81..36755617.81 rows=1 width=36) (actual time=180029.594..180029.595 rows=0 loops=1) -> GroupAggregate (cost=35214909.30..36755617.81 rows=6718533 width=36) (actual time=94017.788..179857.683 rows=6400000loops=1) Group Key: lineitem.l_partkey -> Sort (cost=35214909.30..35694886.14 rows=191990736 width=9) (actual time=94017.750..151138.727 rows=192000551loops=1) Sort Key: lineitem.l_partkey Sort Method: external merge Disk: 3742208kB -> Seq Scan on lineitem (cost=0.00..5519288.36 rows=191990736 width=9) (actual time=0.008..26831.264 rows=192000551loops=1) Planning Time: 0.063 ms Execution Time: 180209.580 ms (9 rows) I still don't understand why the hashagg is costed so low compared to the sort (only about 1/3 of it), because both cases use exactly the same estimates internally. cost_tuplesort ends up with npages = 937455 nruns = 114.435396 input_bytes = 7679629440 log_runs = 1.0 while cost_agg uses pages_read = 937455 pages_written = 937455 relation_size = 7679629440; yet we end up with much lower estimate for hashagg. It however does seem to me this is mostly due to non-I/O costs, considered by cost_tuplesort and perhaps ignored by cost_agg? In particular, most of the sort cost comes from this *startup_cost = comparison_cost * tuples * LOG2(tuples); So I'm wondering if the hashagg is not ignoring similar non-I/O costs for the spilling case. In particular, the initial section computing startup_cost seems to ignore that we may need to so some of the stuff repeatedly - for example we'll repeat hash lookups for spilled tuples, and so on. The other thing is that sort seems to be doing only about half the physical I/O (as measured by iosnoop) compared to hashagg, even though the estimates of pages / input_bytes are exactly the same. For hashagg the iosnoop shows 5921MB reads and 7185MB writes, while sort only does 2895MB reads and 3655MB writes. Which kinda matches the observed sizes of temp files in the two cases, so the input_bytes for sort seems to be a bit overestimated. regards [1] https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote: > So I'm wondering if the hashagg is not ignoring similar non-I/O costs > for the spilling case. In particular, the initial section computing > startup_cost seems to ignore that we may need to so some of the stuff > repeatedly - for example we'll repeat hash lookups for spilled > tuples, > and so on. To fix that, we'd also need to change the cost of in-memory HashAgg, right? > The other thing is that sort seems to be doing only about half the > physical I/O (as measured by iosnoop) compared to hashagg, even > though > the estimates of pages / input_bytes are exactly the same. For > hashagg > the iosnoop shows 5921MB reads and 7185MB writes, while sort only > does > 2895MB reads and 3655MB writes. Which kinda matches the observed > sizes > of temp files in the two cases, so the input_bytes for sort seems to > be > a bit overestimated. Hmm, interesting. How reasonable is it to be making these kinds of changes to the cost model right now? I think your analysis is solid, but I'm worried about making more intrusive changes very late in the cycle. I had originally tried to limit the cost model changes to the new plans I am introducing -- that is, HashAgg plans expected to require disk. That's why I came up with a rather arbitrary penalty. Regards, Jeff Davis
On Mon, Aug 31, 2020 at 11:34:34PM -0700, Jeff Davis wrote: >On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote: >> So I'm wondering if the hashagg is not ignoring similar non-I/O costs >> for the spilling case. In particular, the initial section computing >> startup_cost seems to ignore that we may need to so some of the stuff >> repeatedly - for example we'll repeat hash lookups for spilled >> tuples, >> and so on. > >To fix that, we'd also need to change the cost of in-memory HashAgg, >right? > Why? I don't think we need to change costing of in-memory HashAgg. My assumption was we'd only tweak startup_cost for cases with spilling by adding something like (cpu_operator_cost * npartitions * ntuples). >> The other thing is that sort seems to be doing only about half the >> physical I/O (as measured by iosnoop) compared to hashagg, even >> though >> the estimates of pages / input_bytes are exactly the same. For >> hashagg >> the iosnoop shows 5921MB reads and 7185MB writes, while sort only >> does >> 2895MB reads and 3655MB writes. Which kinda matches the observed >> sizes >> of temp files in the two cases, so the input_bytes for sort seems to >> be >> a bit overestimated. > >Hmm, interesting. > FWIW I suspect some of this difference may be due to logical vs. physical I/O. iosnoop only tracks physical I/O sent to the device, but maybe we do much more logical I/O and it simply does not expire from page cache for the sort. It might behave differently for larger data set, longer query, ... >How reasonable is it to be making these kinds of changes to the cost >model right now? I think your analysis is solid, but I'm worried about >making more intrusive changes very late in the cycle. > >I had originally tried to limit the cost model changes to the new plans >I am introducing -- that is, HashAgg plans expected to require disk. >That's why I came up with a rather arbitrary penalty. > I don't know. I certainly understand the desire not to change things this late. OTOH I'm worried that we'll end up receiving a lot of poor plans post release. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, 2020-09-01 at 11:19 +0200, Tomas Vondra wrote: > Why? I don't think we need to change costing of in-memory HashAgg. My > assumption was we'd only tweak startup_cost for cases with spilling > by > adding something like (cpu_operator_cost * npartitions * ntuples). The code above (the in-memory case) has a clause: startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples; which seems to account only for the hash calculation, because it's multiplying by the number of grouping columns. Your calculation would also use cpu_operator_cost, but just for the lookup. I'm OK with that, but it's a little inconsistent to only count it for the tuples that spill to disk. But why multiply by the number of partitions? Wouldn't it be the depth? A wide fanout will not increase the number of lookups. > FWIW I suspect some of this difference may be due to logical vs. > physical I/O. iosnoop only tracks physical I/O sent to the device, > but > maybe we do much more logical I/O and it simply does not expire from > page cache for the sort. It might behave differently for larger data > set, longer query, ... That would suggest something like a penalty for HashAgg for being a worse IO pattern. Or do you have another suggestion? > I don't know. I certainly understand the desire not to change things > this late. OTOH I'm worried that we'll end up receiving a lot of poor > plans post release. I was reacting mostly to changing the cost of Sort. Do you think changes to Sort are required or did I misunderstand? Regards, Jeff Davis
On Tue, Sep 01, 2020 at 12:58:59PM -0700, Jeff Davis wrote: >On Tue, 2020-09-01 at 11:19 +0200, Tomas Vondra wrote: >> Why? I don't think we need to change costing of in-memory HashAgg. My >> assumption was we'd only tweak startup_cost for cases with spilling >> by >> adding something like (cpu_operator_cost * npartitions * ntuples). > >The code above (the in-memory case) has a clause: > > startup_cost += (cpu_operator_cost * numGroupCols) * input_tuples; > >which seems to account only for the hash calculation, because it's >multiplying by the number of grouping columns. > >Your calculation would also use cpu_operator_cost, but just for the >lookup. I'm OK with that, but it's a little inconsistent to only count >it for the tuples that spill to disk. > >But why multiply by the number of partitions? Wouldn't it be the depth? >A wide fanout will not increase the number of lookups. > Yeah, I think you're right it should be depth, not number of partitions. FWIW I don't know if this is enough to "fix" the costing, it's just something I noticed while looking at the code. >> FWIW I suspect some of this difference may be due to logical vs. >> physical I/O. iosnoop only tracks physical I/O sent to the device, >> but >> maybe we do much more logical I/O and it simply does not expire from >> page cache for the sort. It might behave differently for larger data >> set, longer query, ... > >That would suggest something like a penalty for HashAgg for being a >worse IO pattern. Or do you have another suggestion? > Possibly, yes. I think it'd be good to measure logical I/O (e.g. by adding some instrumentation to LogicalTapeSet) to see if this hypothesis is actually true. FWIW any thoughts about the different in temp size compared to CP_SMALL_TLIST? >> I don't know. I certainly understand the desire not to change things >> this late. OTOH I'm worried that we'll end up receiving a lot of poor >> plans post release. > >I was reacting mostly to changing the cost of Sort. Do you think >changes to Sort are required or did I misunderstand? > Not sure I'm following. I don't think anyone proposed changing costing for Sort. Or did I miss something? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 1, 2020 at 2:19 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > FWIW I suspect some of this difference may be due to logical vs. > physical I/O. iosnoop only tracks physical I/O sent to the device, but > maybe we do much more logical I/O and it simply does not expire from > page cache for the sort. It might behave differently for larger data > set, longer query, ... There is also the fact that the LogicalTapeSetBlocks() instrumentation is known to have problems that we still need to fix: https://www.postgresql.org/message-id/CAH2-Wzn5PCBLUrrds=hD439LtWP+PD7ekRTd=8LdtqJ+KO5D1Q@mail.gmail.com I'm not suggesting that this is a significant factor here. But I can't rule it out just yet either. > I don't know. I certainly understand the desire not to change things > this late. OTOH I'm worried that we'll end up receiving a lot of poor > plans post release. I think that this needs to get fixed before release. -- Peter Geoghegan
On Sun, 2020-08-30 at 17:03 +0200, Tomas Vondra wrote: > So I'm wondering if the hashagg is not ignoring similar non-I/O costs > for the spilling case. In particular, the initial section computing > startup_cost seems to ignore that we may need to so some of the stuff > repeatedly - for example we'll repeat hash lookups for spilled > tuples, > and so on. I tried to analyze this using a slightly different approach: cost units per second of runtime. Obviously this will vary based on the machine, but it's interesting anyway. All of the following fit in system memory. Schema, data, and queries are at the end of this email. A low value of cost-units/second-runtime means "more likely to be chosen incorrectly" and a high value means "more likely to be missed incorrectly". Plan | work_mem | 10M | 100M INT4 | 100M | 10M | | INT4 | (10M groups) | INT4 | TEXT --------+----------+------+--------------+------+----- HashAgg | 4MB | 88 | 63 | 82 | 78 HashAgg | 1TB | 41 | 37 | 33 | 38 Sort | 4MB | 182 | 188 | 174 | 37 Sort | 1TB | 184 | 231 | 189 | 30 Sort is consistent for a given datatype, but it seems that the comparison cost is far from proportionate between data types. HashAgg is consistent between the data types, but the disk costs play a larger role (in this case, there is no actual disk access to worry about, because it fits in system memory). At least for these simple queries, Sort is punished unfairly for INT4, but gets an unfair boost for TEXT. It seems that we should make a change here, but to be conservative for 13, we should limit changes to the new plans, which are the first line (HashAgg that spills). The attached patch makes two adjustments: a 2X disk penalty for HashAgg, and I also add: spill_cost = depth * input_tuples * 2.0 * cpu_tuple_cost The new results: Plan | work_mem | 10M | 100M INT4 | 100M | 10M | | INT4 | (10M groups) | INT4 | TEXT --------+----------+------+--------------+------+----- HashAgg | 4MB | 192 | 131 | 178 | 166 That's much more comparable to Sort for INT4, but makes the gap wider for TEXT. Fortunately, at least for my simple queries, it just barely avoids a plan change to Sort for the TEXT case (which is nearly 5X slower than HashAgg). I think this approach is reasonable for 13: it only changes the costing for HashAgg plans that are expected to spill, it's fairly conservative so we will not get lots of new bad plans, and it still seems to use HashAgg for cases where it clearly should. Note: in-memory HashAgg is still unfairly favored over Sort, at least for these cases. Regards, Jeff Davis create table t10m(i int); insert into t10m select (random()*1000000000)::int from generate_series(1,10000000); explain analyze select distinct i from t10m; create table t100m(i int); insert into t100m select (random()*2000000000)::int from generate_series(1,100000000); explain analyze select distinct i from t100m; -- 100m tuples, 10m groups, 10tuples/group create table t100m_10(i int); insert into t100m_10 select (random()*10000000)::int from generate_series(1,100000000); explain analyze select distinct i from t100m_10; create table text10m(t text collate "C.UTF-8", i int, n numeric); insert into text10m select s.g::text, s.g, s.g::numeric from (select (random()*1000000000)::int as g from generate_series(1,10000000)) s; explain analyze select distinct t from text10m;
Attachment
On Wed, Sep 2, 2020 at 5:18 PM Jeff Davis <pgsql@j-davis.com> wrote: > create table text10m(t text collate "C.UTF-8", i int, n numeric); > insert into text10m select s.g::text, s.g, s.g::numeric from (select > (random()*1000000000)::int as g from generate_series(1,10000000)) s; > explain analyze select distinct t from text10m; Note that you won't get what Postgres considers to be the C collation unless you specify "collate C" -- "C.UTF-8" is the C collation exposed by glibc. The difference matters a lot, because only the former can use abbreviated keys (unless you manually #define TRUST_STRXFRM). And even without abbreviated keys it's probably still significantly faster for other reasons. This doesn't undermine your point, because we don't take the difference into account in cost_sort() -- even though abbreviated keys will regularly make text sorts 2x-3x faster. My point is only that it would be more accurate to say that the costing unfairly boosts sorts on collated texts specifically. Though maybe not when an ICU collation is used (since abbreviated keys will be enabled generally). -- Peter Geoghegan
On Wed, 2020-09-02 at 17:35 -0700, Peter Geoghegan wrote: > On Wed, Sep 2, 2020 at 5:18 PM Jeff Davis <pgsql@j-davis.com> wrote: > > create table text10m(t text collate "C.UTF-8", i int, n numeric); > > insert into text10m select s.g::text, s.g, s.g::numeric from > > (select > > (random()*1000000000)::int as g from generate_series(1,10000000)) > > s; > > explain analyze select distinct t from text10m; > > Note that you won't get what Postgres considers to be the C collation > unless you specify "collate C" -- "C.UTF-8" is the C collation > exposed > by glibc. The difference matters a lot, because only the former can > use abbreviated keys (unless you manually #define TRUST_STRXFRM). And > even without abbreviated keys it's probably still significantly > faster > for other reasons. Thank you. I reran with: create table text10m2(t text collate "C", i int, n numeric); -- same data, same queries And the new table is: Plan | work | 10M | 100M INT4 | 100M | 10M | 10M | mem | INT4 | 10M grps | INT4 | TEXT | TEXTC ---------+------+------+-----------+------+------+------ HashAgg | 4MB | 88 | 63 | 82 | 78 | 80 HashAgg | 1TB | 41 | 37 | 33 | 38 | 43 Sort | 4MB | 182 | 188 | 174 | 37 | 146 Sort | 1TB | 184 | 231 | 189 | 30 | 149 HashAgg* | 4MB | 192 | 131 | 178 | 166 | 176 *: patched For the 'COLLATE "C"' case, the costs still come out the almost the same between HashAgg and Sort, but the runtimes are much closer. So even if it did flip the plan from HashAgg to Sort, it goes from 9.5s (HashAgg) to 12s (Sort), which is not so bad. So the patched version looks good to me at this point. It accounts for Tomas's observations about IO: "The other thing is that sort seems to be doing only about half the physical I/O (as measured by iosnoop) compared to hashagg, even though the estimates of pages / input_bytes are exactly the same." by penalizing HashAgg disk costs by 2X. The patch also accounts for his other observation about missing CPU costs by costing the spilling. Tomas framed the CPU costs as the cost of the extra lookups, but the extra lookups are only in the cases where it misses in the hash table and needs to spill. So I think it's reasonable to consider the extra lookups as a part of the spill cost. The remaining problems are: * comparison costs for Sort should be adjusted to make them relatively consistent between data types * in-memory HashAgg is unfairly favored in a lot of cases I don't think either of those problems need to be (or should be) fixed in 13, but we can revisit in 14 if there are reports of bad plans. Regards, Jeff Davis
On Tue, 2020-09-01 at 23:19 +0200, Tomas Vondra wrote: > FWIW any thoughts about the different in temp size compared to > CP_SMALL_TLIST? Are you referring to results from a while ago? In this thread I don't see what you're referring to. I tried in a simple case on REL_13_STABLE, with and without the CP_SMALL_TLIST change, and I saw only a tiny difference. Do you have a current case that shows a larger difference? The only thing I can think of that might change is the size of the null bitmap or how fields are aligned. Regards, Jeff Davis
On Thu, Sep 03, 2020 at 05:53:43PM -0700, Jeff Davis wrote: >On Tue, 2020-09-01 at 23:19 +0200, Tomas Vondra wrote: >> FWIW any thoughts about the different in temp size compared to >> CP_SMALL_TLIST? > >Are you referring to results from a while ago? In this thread I don't >see what you're referring to. > >I tried in a simple case on REL_13_STABLE, with and without the >CP_SMALL_TLIST change, and I saw only a tiny difference. Do you have a >current case that shows a larger difference? > I'm referring to the last charts in the message from July 27, comparing behavior with CP_SMALL_TLIST fix vs. master (which reverted/replaced the CP_SMALL_TLIST bit). Those charts show that the CP_SMALL_TLIST resulted in smaller temp files (per EXPLAIN ANALYZE the difference is ~25%) and also lower query durations (also in the ~25% range). I can repeat those tests, if needed. [1] https://www.postgresql.org/message-id/20200724012248.y77rpqc73agrsvb3@development >The only thing I can think of that might change is the size of the null >bitmap or how fields are aligned. > Maybe. Not sure. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 2020-09-04 at 14:56 +0200, Tomas Vondra wrote: > Those charts show that the CP_SMALL_TLIST resulted in smaller temp > files > (per EXPLAIN ANALYZE the difference is ~25%) and also lower query > durations (also in the ~25% range). I was able to reproduce the problem, thank you. Only two attributes are needed, so the CP_SMALL_TLIST projected schema only needs a single-byte null bitmap. But if just setting the attributes to NULL rather than projecting them, the null bitmap size is based on all 16 attributes, bumping the bitmap size to two bytes. MAXALIGN(23 + 1) = 24 MAXALIGN(23 + 2) = 32 I think that explains it. It's not ideal, but projection has a cost as well, so I don't think we necessarily need to do something here. If we are motivated to improve this in v14, we could potentially have a different schema for spilled tuples, and perform real projection at spill time. But I don't know if that's worth the extra complexity. Regards, Jeff Davis
On Fri, Sep 04, 2020 at 11:31:36AM -0700, Jeff Davis wrote: >On Fri, 2020-09-04 at 14:56 +0200, Tomas Vondra wrote: >> Those charts show that the CP_SMALL_TLIST resulted in smaller temp >> files >> (per EXPLAIN ANALYZE the difference is ~25%) and also lower query >> durations (also in the ~25% range). > >I was able to reproduce the problem, thank you. > >Only two attributes are needed, so the CP_SMALL_TLIST projected schema >only needs a single-byte null bitmap. > >But if just setting the attributes to NULL rather than projecting them, >the null bitmap size is based on all 16 attributes, bumping the bitmap >size to two bytes. > >MAXALIGN(23 + 1) = 24 >MAXALIGN(23 + 2) = 32 > >I think that explains it. It's not ideal, but projection has a cost as >well, so I don't think we necessarily need to do something here. > >If we are motivated to improve this in v14, we could potentially have a >different schema for spilled tuples, and perform real projection at >spill time. But I don't know if that's worth the extra complexity. > Thanks for the investigation and explanation. Wouldn't it be enough to just use a slot with smaller tuple descriptor? All we'd need to do is creating the descriptor in ExecInitAgg after calling find_hash_columns, and using it for rslot/wslot, and then "mapping" the attributes in hashagg_spill_tuple (which already almost does that, to the extra cost should be 0) and when reading the spilled tuples. So I'm not quite buying the argument that this would make measurable difference ... That being said, I won't insist on fixing this in v13 - at least we know what the issue is and we can fix it later. The costing seems like a more serious open item. OTOH I don't think this example is particularly extreme, and I wouldn't be surprised if we se even worse examples in practice - tables tend to be quite wide and aggregation of just a few columns seems likely. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, 2020-09-04 at 21:01 +0200, Tomas Vondra wrote: > Wouldn't it be enough to just use a slot with smaller tuple > descriptor? > All we'd need to do is creating the descriptor in ExecInitAgg after > calling find_hash_columns, and using it for rslot/wslot, and then > "mapping" the attributes in hashagg_spill_tuple (which already almost > does that, to the extra cost should be 0) and when reading the > spilled > tuples. That's a good point, it's probably not much code to make it work. > So I'm not quite buying the argument that this would make > measurable difference ... I meant "projection of all input tuples" (i.e. CP_SMALL_TLIST) has a cost. If we project only at spill time, it should be fine. Regards, Jeff Davis
Hi, I've tested the costing changes on the simplified TPC-H query, on two different machines, and it seems like a clear improvement. This is using the same cost/duration measure, which I think is pretty neat way to look at this. Sure, it's imperfect (depends on which cost and durations you actually take etc.), but it makes the comparisons easier and for simple queries it does not matter that much. The costing clearly depends on parameters like random_page_cost and how it matches the hardware, but for the machine with SSD and default random_page_cost the effect looks like this: work_mem sort master patched --------------------------------------- 1MB 249 95 215 2MB 256 89 187 4MB 233 90 192 8MB 227 70 124 16MB 245 67 118 32MB 261 63 111 64MB 256 59 104 256MB 266 55 102 and with random_page_cost reduced to 1.5 it looks like this: work_mem sort master patched ------------------------------------------ 1MB 221 63 150 2MB 227 64 133 4MB 214 65 137 8MB 214 57 95 16MB 232 53 90 32MB 247 50 85 64MB 249 47 80 256MB 258 46 77 And on a machine with SATA RAID storage it looks like this: work_mem sort master patched ----------------------------------------- 1MB 102 41 94 2MB 101 34 77 4MB 99 35 78 8MB 98 35 79 16MB 98 25 50 32MB 106 26 51 64MB 106 26 51 256MB 105 29 50 So yeah, the patched costing is much closer to sort (from the point of this cost/duration metric), although for higher work_mem values there's still a clear gap where the hashing seems to be under-costed by a factor of ~2 or more. I think this is simply showing that sort may the effect of increasing work_mem is much more pronounced for sort/groupagg compared to hashagg. For example on the SDD machine the duration changes like this: work_mem hashagg groupagg --------------------------------- 1MB 217 201 2MB 178 195 4MB 176 186 8MB 160 176 16MB 168 163 32MB 180 153 64MB 189 143 256MB 204 138 and the SATA RAID storage seems to behave in a similar way (although the difference is smaller). So in general I think this costing change is reasonable. It might not go far enough, but it certainly makes it probably makes it easier to tweak the rest by changing random_page_cost etc. Not sure if we need/should tweak the costing to reduce the effect of work_mem (on hashagg). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, 2020-09-06 at 23:21 +0200, Tomas Vondra wrote: > I've tested the costing changes on the simplified TPC-H query, on two > different machines, and it seems like a clear improvement. Thank you. Committed. > So yeah, the patched costing is much closer to sort (from the point > of > this cost/duration metric), although for higher work_mem values > there's > still a clear gap where the hashing seems to be under-costed by a > factor > of ~2 or more. There seems to be a cliff right after 4MB. Perhaps lookup costs on a larger hash table? Regards, Jeff Davis
On Mon, Sep 07, 2020 at 01:55:28PM -0700, Jeff Davis wrote: >On Sun, 2020-09-06 at 23:21 +0200, Tomas Vondra wrote: >> I've tested the costing changes on the simplified TPC-H query, on two >> different machines, and it seems like a clear improvement. > >Thank you. Committed. > >> So yeah, the patched costing is much closer to sort (from the point >> of >> this cost/duration metric), although for higher work_mem values >> there's >> still a clear gap where the hashing seems to be under-costed by a >> factor >> of ~2 or more. > >There seems to be a cliff right after 4MB. Perhaps lookup costs on a >larger hash table? > I assume you mean higher costs due to hash table outgrowing some sort of CPU cache (L2/L3), right? Good guess - the CPU has ~6MB cache, but no. This seems to be merely due to costing, because the raw cost/duration looks like this: work_mem cost duration --------------------------------- 1MB 20627403 216861 2MB 15939722 178237 4MB 15939722 176296 8MB 11252041 160479 16MB 11252041 168304 32MB 11252041 179567 64MB 11252041 189410 256MB 11252041 204206 This is unpatched master, with the costing patch it looks similar except that the cost is about 2x higher. On the SATA RAID machine, it looks like this: work_mem cost duration ----------------------------------- 1MB 108358461 1147269 2MB 77381688 1004895 4MB 77381688 994853 8MB 77381688 980071 16MB 46404915 930511 32MB 46404915 902167 64MB 46404915 908757 256MB 46404915 926862 So roughly the same - the cost drops to less than 50%, but the duration really does not. This is what I referred to when I said "Not sure if we need/should tweak the costing to reduce the effect of work_mem (on hashagg)." For sort this seems to behave a bit more nicely - the cost and duration (with increasing work_mem) are correlated quite well, I think. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services