Re: scale parallel_tuple_cost by tuple width - Mailing list pgsql-hackers
| From | Andrew Dunstan |
|---|---|
| Subject | Re: scale parallel_tuple_cost by tuple width |
| Date | |
| Msg-id | 2d7c4e54-6a0b-4b1a-87a6-278c49fa52f0@dunslane.net Whole thread Raw |
| In response to | Re: scale parallel_tuple_cost by tuple width (David Rowley <dgrowleyml@gmail.com>) |
| List | pgsql-hackers |
On 2026-03-30 Mo 6:51 PM, David Rowley wrote: > On Tue, 31 Mar 2026 at 03:17, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Andrew Dunstan <andrew@dunslane.net> writes: >>> While investigating a performance issue, I found that it was extremely >>> difficult to get a parallel plan in some cases due to the fixed >>> parallel_tuple_cost. But this cost is not really fixed - it's going to >>> be larger for larger tuples. So this proposal adjusts the cost used >>> according to how large we expect the results to be. >> Unfortunately, I'm afraid that this is going to produce mostly >> "garbage in, garbage out" estimates, because our opinion of how wide >> tuples-in-flight are is pretty shaky. (See get_expr_width and >> particularly get_typavgwidth, and note that we only have good >> statistics-based numbers for plain Vars not function outputs.) >> I agree that it could be useful to have some kind of adjustment here, >> but I fear that linear scaling is putting way too much faith in the >> quality of the data. > (I suspect you're saying this because of the "Benchmark 2" in the text > file, which contains aggregates which return a varlena type, of which > we won't estimate the width very well for...) > > Sure, it's certainly true that there are cases where we don't get the > width estimate right, but that's not stopped us using them before. So > why is this case so much more critical? We now also have GROUP BY > before join abilities in the planner, which I suspect must also be > putting trust into the very same thing. Also, varlena-returning > Aggrefs aren't going to be the Gather/GatherMerge targetlist, so why > avoid making improvements in this area because we're not great at one > of the things that could be in the targetlist? > > For the patch and the analysis: This reminds me of [1], where some > reverse-engineering of costs from query run-times was done, which > ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To > get that for this case would require various tests with different > tuple widths and ensuring that the costs scale linearly with the > run-time of the query with the patched version. Of course, the test > query would have to have perfect width estimates, but that could be > easy enough to do by populating a text typed GROUP BY column and > populating that with all the same width of text for each of the tests > before increasing the width for the next test, using a fixed-width > aggregate each time, e.g count(*). The "#define > PARALLEL_TUPLE_COST_REF_WIDTH 100" does seem to be quite a round > number. It would be good to know how close this is to reality. > Ideally, it would be good to see results from an Apple M<something> > chip and recent x86. In my experience, these produce very different > performance results, so it might be nice to find a value that is > somewhere in the middle of what we get from those machines. I suspect > having the GROUP BY column with text widths from 8 to 1024, increasing > in powers of two would be enough data points. I followed your suggested methodology to measure how Gather IPC cost actually scales with tuple width. Setup: 10M rows, 100K distinct text values per table, text column padded to a fixed width with lpad(). Query: SELECT txt, count(*) FROM ptc_bench_W GROUP BY txt. This produces Partial HashAggregate in workers, then Gather passes ~240K partial-aggregate tuples whose width is dominated by the text column. 2 workers, work_mem=256MB, cassert=off, debugoptimized build, aarch64 Linux. I tested widths from 8 to 1024 bytes (10 data points). For each width, I ran 5 iterations of both parallel and serial execution, and computed the Gather overhead as: overhead = T_parallel - T_serial / 3 This isolates the IPC cost: the serial time captures pure scan + aggregate work, and dividing by 3 gives the ideal parallel time (2 workers + leader). The excess is Gather overhead. Results (microseconds per tuple through Gather, median of 5 runs): Width(B) us/tuple Implied ptc (if ptc=0.1 at w=100) -------- -------- ---------------------------------- 8 0.30 0.032 16 0.24 0.025 32 0.77 0.083 64 0.72 0.078 128 1.03 0.111 256 1.62 0.175 384 2.90 0.313 512 3.21 0.346 768 4.12 0.445 1024 5.56 0.600 The best-fit models: Linear: cost(w) = 0.42 + 0.0051 * w R² = 0.983 Power law: cost(w) = 0.061 * w^0.63 R² = 0.966 Linear fits best. One notable finding: at the proposed reference width of 100 bytes, the total predicted cost is 0.42 + 0.51 = 0.93 us/tuple, of which 0.42 is fixed. The original patch used PARALLEL_TUPLE_COST_FIXED_FRAC = 0.10, which substantially underestimates the width-independent component. A higher fixed fraction would dampen the width adjustment, which also partly addresses Tom's concern about sensitivity to width estimate errors: with ~45% of the cost being fixed, even a 2x error in width only translates to a ~1.5x error in total cost. The script used to get the timings is attached. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Attachment
pgsql-hackers by date: