Re: scale parallel_tuple_cost by tuple width - Mailing list pgsql-hackers

From Andrew Dunstan
Subject Re: scale parallel_tuple_cost by tuple width
Date
Msg-id 2d7c4e54-6a0b-4b1a-87a6-278c49fa52f0@dunslane.net
Whole thread Raw
In response to Re: scale parallel_tuple_cost by tuple width  (David Rowley <dgrowleyml@gmail.com>)
List pgsql-hackers
On 2026-03-30 Mo 6:51 PM, David Rowley wrote:
> On Tue, 31 Mar 2026 at 03:17, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>> While investigating a performance issue, I found that it was extremely
>>> difficult to get a parallel plan in some cases due to the fixed
>>> parallel_tuple_cost. But this cost is not really fixed - it's going to
>>> be larger for larger tuples. So this proposal adjusts the cost used
>>> according to how large we expect the results to be.
>> Unfortunately, I'm afraid that this is going to produce mostly
>> "garbage in, garbage out" estimates, because our opinion of how wide
>> tuples-in-flight are is pretty shaky.  (See get_expr_width and
>> particularly get_typavgwidth, and note that we only have good
>> statistics-based numbers for plain Vars not function outputs.)
>> I agree that it could be useful to have some kind of adjustment here,
>> but I fear that linear scaling is putting way too much faith in the
>> quality of the data.
> (I suspect you're saying this because of the "Benchmark 2" in the text
> file, which contains aggregates which return a varlena type, of which
> we won't estimate the width very well for...)
>
> Sure, it's certainly true that there are cases where we don't get the
> width estimate right, but that's not stopped us using them before. So
> why is this case so much more critical?  We now also have GROUP BY
> before join abilities in the planner, which I suspect must also be
> putting trust into the very same thing. Also, varlena-returning
> Aggrefs aren't going to be the Gather/GatherMerge targetlist, so why
> avoid making improvements in this area because we're not great at one
> of the things that could be in the targetlist?
>
> For the patch and the analysis: This reminds me of [1], where some
> reverse-engineering of costs from query run-times was done, which
> ended up figuring out what we set APPEND_CPU_COST_MULTIPLIER to. To
> get that for this case would require various tests with different
> tuple widths and ensuring that the costs scale linearly with the
> run-time of the query with the patched version. Of course, the test
> query would have to have perfect width estimates, but that could be
> easy enough to do by populating a text typed GROUP BY column and
> populating that with all the same width of text for each of the tests
> before increasing the width for the next test, using a fixed-width
> aggregate each time, e.g count(*). The "#define
> PARALLEL_TUPLE_COST_REF_WIDTH 100" does seem to be quite a round
> number. It would be good to know how close this is to reality.
> Ideally, it would be good to see results from an Apple M<something>
> chip and recent x86. In my experience, these produce very different
> performance results, so it might be nice to find a value that is
> somewhere in the middle of what we get from those machines. I suspect
> having the GROUP BY column with text widths from 8 to 1024, increasing
> in powers of two would be enough data points.


I followed your suggested methodology to measure how Gather IPC
cost actually scales with tuple width.

Setup: 10M rows, 100K distinct text values per table, text column
padded to a fixed width with lpad().  Query: SELECT txt, count(*)
FROM ptc_bench_W GROUP BY txt.  This produces Partial HashAggregate
in workers, then Gather passes ~240K partial-aggregate tuples whose
width is dominated by the text column.  2 workers, work_mem=256MB,
cassert=off, debugoptimized build, aarch64 Linux.

I tested widths from 8 to 1024 bytes (10 data points).  For each
width, I ran 5 iterations of both parallel and serial execution,
and computed the Gather overhead as:

   overhead = T_parallel - T_serial / 3

This isolates the IPC cost: the serial time captures pure scan +
aggregate work, and dividing by 3 gives the ideal parallel time
(2 workers + leader).  The excess is Gather overhead.

Results (microseconds per tuple through Gather, median of 5 runs):

   Width(B)  us/tuple   Implied ptc (if ptc=0.1 at w=100)
   --------  --------   ----------------------------------
          8     0.30     0.032
         16     0.24     0.025
         32     0.77     0.083
         64     0.72     0.078
        128     1.03     0.111
        256     1.62     0.175
        384     2.90     0.313
        512     3.21     0.346
        768     4.12     0.445
       1024     5.56     0.600

The best-fit models:

   Linear:    cost(w) = 0.42 + 0.0051 * w     R² = 0.983
   Power law: cost(w) = 0.061 * w^0.63         R² = 0.966

Linear fits best.

One notable finding: at the proposed reference width of 100 bytes,
the total predicted cost is 0.42 + 0.51 = 0.93 us/tuple, of
which 0.42 is fixed.
The original patch used PARALLEL_TUPLE_COST_FIXED_FRAC = 0.10,
which substantially underestimates the width-independent component.
A higher fixed fraction would dampen the width adjustment, which
also partly addresses Tom's concern about sensitivity to width
estimate errors: with ~45% of the cost being fixed, even a 2x
error in width only translates to a ~1.5x error in total cost.

The script used to get the timings is attached.

cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Nisha Moond
Date:
Subject: Re: Use SIGTERM instead of SIGUSR1 for slotsync worker to exit during promotion?
Next
From: Ashutosh Bapat
Date:
Subject: Re: Online PostgreSQL version() updates