Thread: BUG #18247: Integer overflow leads to negative width

BUG #18247: Integer overflow leads to negative width

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18247
Logged by:          RekGRpth
Email address:      rekgrpth@gmail.com
PostgreSQL version: 16.1
Operating system:   docker alpine
Description:

CREATE TABLE t (
    c01 character(10485760),
    c02 character(10485760),
    c03 character(10485760),
    c04 character(10485760),
    c05 character(10485760),
    c06 character(10485760),
    c07 character(10485760),
    c08 character(10485760),
    c09 character(10485760),
    c10 character(10485760),
    c11 character(10485760),
    c12 character(10485760),
    c13 character(10485760),
    c14 character(10485760),
    c15 character(10485760),
    c16 character(10485760),
    c17 character(10485760),
    c18 character(10485760),
    c19 character(10485760),
    c20 character(10485760),
    c21 character(10485760),
    c22 character(10485760),
    c23 character(10485760),
    c24 character(10485760),
    c25 character(10485760),
    c26 character(10485760),
    c27 character(10485760),
    c28 character(10485760),
    c29 character(10485760),
    c30 character(10485760),
    c31 character(10485760),
    c32 character(10485760),
    c33 character(10485760),
    c34 character(10485760),
    c35 character(10485760),
    c36 character(10485760),
    c37 character(10485760),
    c38 character(10485760),
    c39 character(10485760),
    c40 character(10485760),
    c41 character(10485760),
    c42 character(10485760),
    c43 character(10485760),
    c44 character(10485760),
    c45 character(10485760),
    c46 character(10485760),
    c47 character(10485760),
    c48 character(10485760),
    c49 character(10485760),
    c50 character(10485760),
    c51 character(10485760),
    c52 character(10485760)
);
EXPLAIN SELECT * FROM t;
                         QUERY PLAN                         
------------------------------------------------------------
 Seq Scan on t  (cost=0.00..10.00 rows=1 width=-2113929008)
(1 row)


Re: BUG #18247: Integer overflow leads to negative width

From
Richard Guo
Date:

On Thu, Dec 14, 2023 at 5:29 PM PG Bug reporting form <noreply@postgresql.org> wrote:
EXPLAIN SELECT * FROM t;
                         QUERY PLAN                         
------------------------------------------------------------
 Seq Scan on t  (cost=0.00..10.00 rows=1 width=-2113929008)
(1 row)

Interesting.  In an Assert-enabled build this query will cause the
Assert failure in set_rel_width().

    Assert(tuple_width >= 0);

Can we just error out when an overflow occurs?

Thanks
Richard

Re: BUG #18247: Integer overflow leads to negative width

From
Julien Rouhaud
Date:
On Thu, Dec 14, 2023 at 06:34:44PM +0800, Richard Guo wrote:
> On Thu, Dec 14, 2023 at 5:29 PM PG Bug reporting form <
> noreply@postgresql.org> wrote:
>
> > EXPLAIN SELECT * FROM t;
> >                          QUERY PLAN
> > ------------------------------------------------------------
> >  Seq Scan on t  (cost=0.00..10.00 rows=1 width=-2113929008)
> > (1 row)
>
>
> Interesting.  In an Assert-enabled build this query will cause the
> Assert failure in set_rel_width().
>
>     Assert(tuple_width >= 0);
>
> Can we just error out when an overflow occurs?

I'm worried that it could have quite a broad impact, as the same problem could
easily arise in some more conventional cases, e.g. after joining some already
large relations.  I'm sure that some people already hit that in production
without looking at explain plans, and wouldn't be happy to see it broken during
the next update.



Re: BUG #18247: Integer overflow leads to negative width

From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes:
> On Thu, Dec 14, 2023 at 5:29 PM PG Bug reporting form <
> noreply@postgresql.org> wrote:
>> EXPLAIN SELECT * FROM t;
>> QUERY PLAN
>> ------------------------------------------------------------
>> Seq Scan on t  (cost=0.00..10.00 rows=1 width=-2113929008)
>> (1 row)

> Can we just error out when an overflow occurs?

Probably better to clamp tuple width estimates to MaxAllocSize.
Anything larger would not correspond to reality anyhow.

            regards, tom lane



Re: BUG #18247: Integer overflow leads to negative width

From
Richard Guo
Date:

On Thu, Dec 14, 2023 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> On Thu, Dec 14, 2023 at 5:29 PM PG Bug reporting form <
> noreply@postgresql.org> wrote:
>> EXPLAIN SELECT * FROM t;
>> QUERY PLAN
>> ------------------------------------------------------------
>> Seq Scan on t  (cost=0.00..10.00 rows=1 width=-2113929008)
>> (1 row)

> Can we just error out when an overflow occurs?

Probably better to clamp tuple width estimates to MaxAllocSize.
Anything larger would not correspond to reality anyhow.

Fair point.  How about the attached patch?

Thanks
Richard
Attachment

Re: BUG #18247: Integer overflow leads to negative width

From
RekGRpth
Date:
How bad would it be if, after overflowing, the width value was within
the allowed range?

пт, 15 дек. 2023 г. в 07:28, Richard Guo <guofenglinux@gmail.com>:
>
>
> On Thu, Dec 14, 2023 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Richard Guo <guofenglinux@gmail.com> writes:
>> > On Thu, Dec 14, 2023 at 5:29 PM PG Bug reporting form <
>> > noreply@postgresql.org> wrote:
>> >> EXPLAIN SELECT * FROM t;
>> >> QUERY PLAN
>> >> ------------------------------------------------------------
>> >> Seq Scan on t  (cost=0.00..10.00 rows=1 width=-2113929008)
>> >> (1 row)
>>
>> > Can we just error out when an overflow occurs?
>>
>> Probably better to clamp tuple width estimates to MaxAllocSize.
>> Anything larger would not correspond to reality anyhow.
>
>
> Fair point.  How about the attached patch?
>
> Thanks
> Richard



Re: BUG #18247: Integer overflow leads to negative width

From
Tom Lane
Date:
RekGRpth <rekgrpth@gmail.com> writes:
> How bad would it be if, after overflowing, the width value was within
> the allowed range?

Really pretty negligible, I should think.  You might get a not-great
choice of plan.  As Richard noted, there are some Asserts you could
hit; but not in a production build.

            regards, tom lane



Re: BUG #18247: Integer overflow leads to negative width

From
Alexander Lakhin
Date:
Hello Richard,

15.12.2023 05:28, Richard Guo wrote:

Fair point.  How about the attached patch?

Your patch looks good to me, but maybe you would find it suitable to fix in
passing one more integer overflow in costsize.c?

Concretely, the query:
CREATE TABLE t(id int PRIMARY KEY, i int);
EXPLAIN (VERBOSE)
UPDATE t SET i = ni FROM (SELECT g id, 1 ni FROM generate_series(1, 2147483648) g) s WHERE t.id = s.id;

when executed with ubsan-enabled build, gives:
costsize.c:1017:12: runtime error: 2.14748e+09 is outside the range of representable values of type 'int'
    #0 0x5603325818e0 in cost_bitmap_heap_scan .../src/backend/optimizer/path/costsize.c:1017:12
    #1 0x5603326cc519 in create_bitmap_heap_path .../src/backend/optimizer/util/pathnode.c:1065:2
...

Without ubsan enabled, the query:
EXPLAIN (VERBOSE)
UPDATE t SET i = ni FROM (SELECT g id, 1 ni FROM generate_series(1, 2147483648) g) s WHERE t.id = s.id;

executed visually similar to:
EXPLAIN (VERBOSE, ANALYZE)
UPDATE t SET i = ni FROM (SELECT g id, 1 ni FROM generate_series(1, 2147483647) g) s WHERE t.id = s.id;

but quite longer:
 Update on public.t  (cost=60.85..27122613.04 rows=0 width=0) (actual time=225204.159..225204.162 rows=0 loops=1)
vs
 Update on public.t  (cost=60.85..27122613.03 rows=0 width=0) (actual time=153015.851..153015.852 rows=0 loops=1)

Best regards,
Alexander

Re: BUG #18247: Integer overflow leads to negative width

From
Richard Guo
Date:

On Fri, Dec 15, 2023 at 2:00 PM Alexander Lakhin <exclusion@gmail.com> wrote:
Your patch looks good to me, but maybe you would find it suitable to fix in
passing one more integer overflow in costsize.c?

Concretely, the query:
CREATE TABLE t(id int PRIMARY KEY, i int);
EXPLAIN (VERBOSE)
UPDATE t SET i = ni FROM (SELECT g id, 1 ni FROM generate_series(1, 2147483648) g) s WHERE t.id = s.id;

when executed with ubsan-enabled build, gives:
costsize.c:1017:12: runtime error: 2.14748e+09 is outside the range of representable values of type 'int'
    #0 0x5603325818e0 in cost_bitmap_heap_scan .../src/backend/optimizer/path/costsize.c:1017:12
    #1 0x5603326cc519 in create_bitmap_heap_path .../src/backend/optimizer/util/pathnode.c:1065:2

Nice catch.  The overflow occurs when cost_bitmap_heap_scan() calls
compute_bitmap_pages(), and the loop_count parameter is converted from
double to int.  I wonder if we can change the loop_count parameter to be
double for compute_bitmap_pages() to avoid such overflow.

Thanks
Richard

Re: BUG #18247: Integer overflow leads to negative width

From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes:
> On Fri, Dec 15, 2023 at 2:00 PM Alexander Lakhin <exclusion@gmail.com>
>> Your patch looks good to me, but maybe you would find it suitable to fix in
>> passing one more integer overflow in costsize.c?

> Nice catch.  The overflow occurs when cost_bitmap_heap_scan() calls
> compute_bitmap_pages(), and the loop_count parameter is converted from
> double to int.  I wonder if we can change the loop_count parameter to be
> double for compute_bitmap_pages() to avoid such overflow.

Yeah.  Seems like a flat-out error in da08a6598: that logic had been
treating loop_count as double for years, and then when it was split
out into a subroutine, somebody carelessly declared the argument as
int.  (Even sillier, all the callers are trying to pass doubles.)

compute_bitmap_pages is substantially below par as to commenting,
too.

However, I'd be a bit uncomfortable about back-patching; since that
function is globally exposed, it's at least possible that some
extension is calling it and would see an ABI break.  Is it good enough
to fix this in HEAD?  I'd argue yes, given that a loop_count larger
than INT_MAX seems like a pretty improbable case.

            regards, tom lane



Re: BUG #18247: Integer overflow leads to negative width

From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes:
> On Thu, Dec 14, 2023 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Probably better to clamp tuple width estimates to MaxAllocSize.
>> Anything larger would not correspond to reality anyhow.

> Fair point.  How about the attached patch?

We'd need to hit at least build_joinrel_tlist too.  Not sure
offhand whether this is enough to cover upper-relation tlists.

As far as the specifics go, is it enough to clamp once?  I think
we'd either have to clamp after each addition, or change the
running-sum variables to double and clamp just before storing
into the final width field.  The latter seems probably
less error-prone in the long run.

Also, given that we'll need at least three copies of the clamp
rule, I wonder if it's worth inventing a function comparable
to clamp_row_est().

            regards, tom lane



Re: BUG #18247: Integer overflow leads to negative width

From
RekGRpth
Date:
If the value itself is not important, but the comparison of values is
important, then maybe it is possible to non-dimensionalize these
values?

пт, 15 дек. 2023 г. в 20:43, Tom Lane <tgl@sss.pgh.pa.us>:
>
> Richard Guo <guofenglinux@gmail.com> writes:
> > On Thu, Dec 14, 2023 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> Probably better to clamp tuple width estimates to MaxAllocSize.
> >> Anything larger would not correspond to reality anyhow.
>
> > Fair point.  How about the attached patch?
>
> We'd need to hit at least build_joinrel_tlist too.  Not sure
> offhand whether this is enough to cover upper-relation tlists.
>
> As far as the specifics go, is it enough to clamp once?  I think
> we'd either have to clamp after each addition, or change the
> running-sum variables to double and clamp just before storing
> into the final width field.  The latter seems probably
> less error-prone in the long run.
>
> Also, given that we'll need at least three copies of the clamp
> rule, I wonder if it's worth inventing a function comparable
> to clamp_row_est().
>
>                         regards, tom lane



Re: BUG #18247: Integer overflow leads to negative width

From
Richard Guo
Date:

On Fri, Dec 15, 2023 at 11:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> On Fri, Dec 15, 2023 at 2:00 PM Alexander Lakhin <exclusion@gmail.com>
>> Your patch looks good to me, but maybe you would find it suitable to fix in
>> passing one more integer overflow in costsize.c?

> Nice catch.  The overflow occurs when cost_bitmap_heap_scan() calls
> compute_bitmap_pages(), and the loop_count parameter is converted from
> double to int.  I wonder if we can change the loop_count parameter to be
> double for compute_bitmap_pages() to avoid such overflow.

However, I'd be a bit uncomfortable about back-patching; since that
function is globally exposed, it's at least possible that some
extension is calling it and would see an ABI break.  Is it good enough
to fix this in HEAD?  I'd argue yes, given that a loop_count larger
than INT_MAX seems like a pretty improbable case.

I agree with you that it's good enough to fix this in HEAD.  The lack of
complaints from fields for so many years suggests that it's not a common
case to have loop_count larger than INT_MAX.

Thanks
Richard

Re: BUG #18247: Integer overflow leads to negative width

From
Richard Guo
Date:

On Fri, Dec 15, 2023 at 11:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> On Thu, Dec 14, 2023 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Probably better to clamp tuple width estimates to MaxAllocSize.
>> Anything larger would not correspond to reality anyhow.

> Fair point.  How about the attached patch?

We'd need to hit at least build_joinrel_tlist too.

Right.  And I think we also need to hit add_placeholders_to_joinrel.
 
  Not sure
offhand whether this is enough to cover upper-relation tlists.

I think we need to do the same to create_one_window_path.  For
intermediate WindowAggPaths, we need to add the WindowFuncs to their
output targets, and that would increase the width of the tlists.

Also, it is possible that there could be tuple-width overflow occurring
within function get_rel_data_width().  The return value of this function
is used to calculate density, or returned by get_relation_data_width().
So I wonder if we could change the return type of get_rel_data_width()
and get_relation_data_width() to be double, and handle the overflow in
the callers if needed.  But this may lead to ABI break.
 
As far as the specifics go, is it enough to clamp once?  I think
we'd either have to clamp after each addition, or change the
running-sum variables to double and clamp just before storing
into the final width field.  The latter seems probably
less error-prone in the long run.

Agreed.
 
Also, given that we'll need at least three copies of the clamp
rule, I wonder if it's worth inventing a function comparable
to clamp_row_est().

Yeah, I think we should do that.

Attached is an updated patch for all the changes.  It also changes the
loop_count parameter to be double for compute_bitmap_pages() in passing.
It does not improve the comments for compute_bitmap_pages() though.

Thanks
Richard
Attachment

Re: BUG #18247: Integer overflow leads to negative width

From
Junwang Zhao
Date:
On Mon, Dec 18, 2023 at 4:44 PM Richard Guo <guofenglinux@gmail.com> wrote:
>
>
> On Fri, Dec 15, 2023 at 11:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Richard Guo <guofenglinux@gmail.com> writes:
>> > On Thu, Dec 14, 2023 at 10:43 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> >> Probably better to clamp tuple width estimates to MaxAllocSize.
>> >> Anything larger would not correspond to reality anyhow.
>>
>> > Fair point.  How about the attached patch?
>>
>> We'd need to hit at least build_joinrel_tlist too.
>
>
> Right.  And I think we also need to hit add_placeholders_to_joinrel.
>
>>
>>   Not sure
>> offhand whether this is enough to cover upper-relation tlists.
>
>
> I think we need to do the same to create_one_window_path.  For
> intermediate WindowAggPaths, we need to add the WindowFuncs to their
> output targets, and that would increase the width of the tlists.
>
> Also, it is possible that there could be tuple-width overflow occurring
> within function get_rel_data_width().  The return value of this function
> is used to calculate density, or returned by get_relation_data_width().
> So I wonder if we could change the return type of get_rel_data_width()
> and get_relation_data_width() to be double, and handle the overflow in
> the callers if needed.  But this may lead to ABI break.
>
>>
>> As far as the specifics go, is it enough to clamp once?  I think
>> we'd either have to clamp after each addition, or change the
>> running-sum variables to double and clamp just before storing
>> into the final width field.  The latter seems probably
>> less error-prone in the long run.
>
>
> Agreed.
>
>>
>> Also, given that we'll need at least three copies of the clamp
>> rule, I wonder if it's worth inventing a function comparable
>> to clamp_row_est().
>
>
> Yeah, I think we should do that.
>
> Attached is an updated patch for all the changes.  It also changes the
> loop_count parameter to be double for compute_bitmap_pages() in passing.
> It does not improve the comments for compute_bitmap_pages() though.
>
I guess using double for the sum is in case of overflow of int64?
pg_class.relnatts
is of type int16, I guess it's not possible to overflow int64.

Using *int64* for the tuple_width is more intuitive, the
clamp_width_est will be:

+int
+clamp_width_est(int64 tuple_width)
+{
+ /*
+ * Clamp tuple-width estimate to MaxAllocSize in case it exceeds the limit
+ * or overflows.  Anything larger than MaxAllocSize would not align with
+ * reality.
+ */
+ if (tuple_width > MaxAllocSize)
+ tuple_width = MaxAllocSize;
+
+ return (int) tuple_width;
+}

> Thanks
> Richard



--
Regards
Junwang Zhao



Re: BUG #18247: Integer overflow leads to negative width

From
Tom Lane
Date:
Junwang Zhao <zhjwpku@gmail.com> writes:
> I guess using double for the sum is in case of overflow of int64?
> pg_class.relnatts
> is of type int16, I guess it's not possible to overflow int64.

Not sure what I think about that.  The main advantage of double is
that you definitely don't need to worry about overflow.  It's
probably about the same speed either way on modern hardware.
(On 32-bit hardware it's quite likely that double is faster,
since even 32-bit CPUs tend to have hardware floating point.)
I left it as double for the moment, but we could discuss that
some more.

Some review comments:

* clamp_width_est doesn't need to check isinf, since the
comparison will work just fine.  It does need to check isnan,
unless we switch to int64.

* We can fold the Assert about width >= 0 into clamp_width_est
too; it fits naturally there and saves code in the callers.

* I did not like changing the output type of get_rel_data_width;
it seems better to me to have it do clamp_width_est internally.
For one reason, several of the call sites are relying on
doing integer division, which the v2 patch might break.

The attached v3 fixes those things and makes some cosmetic
changes (mostly comments).

            regards, tom lane

diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index 5227346aeb..630b445546 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -218,6 +218,29 @@ clamp_row_est(double nrows)
     return nrows;
 }

+/*
+ * clamp_width_est
+ *        Force a tuple-width estimate to a sane value.
+ */
+int32
+clamp_width_est(double tuple_width)
+{
+    /*
+     * Anything more than MaxAllocSize is clearly bogus, since we could not
+     * create a tuple that large.  Treat NaN like infinity.
+     */
+    if (tuple_width > MaxAllocSize || isnan(tuple_width))
+        return (int32) MaxAllocSize;
+
+    /*
+     * Unlike clamp_row_est, we just Assert that the value isn't negative,
+     * rather than masking such errors.
+     */
+    Assert(tuple_width >= 0);
+
+    return (int32) tuple_width;
+}
+
 /*
  * clamp_cardinality_to_long
  *        Cast a Cardinality value to a sane long value.
@@ -6101,7 +6124,7 @@ static void
 set_rel_width(PlannerInfo *root, RelOptInfo *rel)
 {
     Oid            reloid = planner_rt_fetch(rel->relid, root)->relid;
-    int32        tuple_width = 0;
+    double        tuple_width = 0;
     bool        have_wholerow_var = false;
     ListCell   *lc;

@@ -6213,7 +6236,7 @@ set_rel_width(PlannerInfo *root, RelOptInfo *rel)
      */
     if (have_wholerow_var)
     {
-        int32        wholerow_width = MAXALIGN(SizeofHeapTupleHeader);
+        double        wholerow_width = MAXALIGN(SizeofHeapTupleHeader);

         if (reloid != InvalidOid)
         {
@@ -6230,7 +6253,7 @@ set_rel_width(PlannerInfo *root, RelOptInfo *rel)
                 wholerow_width += rel->attr_widths[i - rel->min_attr];
         }

-        rel->attr_widths[0 - rel->min_attr] = wholerow_width;
+        rel->attr_widths[0 - rel->min_attr] = clamp_width_est(wholerow_width);

         /*
          * Include the whole-row Var as part of the output tuple.  Yes, that
@@ -6239,8 +6262,7 @@ set_rel_width(PlannerInfo *root, RelOptInfo *rel)
         tuple_width += wholerow_width;
     }

-    Assert(tuple_width >= 0);
-    rel->reltarget->width = tuple_width;
+    rel->reltarget->width = clamp_width_est(tuple_width);
 }

 /*
@@ -6258,7 +6280,7 @@ set_rel_width(PlannerInfo *root, RelOptInfo *rel)
 PathTarget *
 set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target)
 {
-    int32        tuple_width = 0;
+    double        tuple_width = 0;
     ListCell   *lc;

     /* Vars are assumed to have cost zero, but other exprs do not */
@@ -6282,8 +6304,7 @@ set_pathtarget_cost_width(PlannerInfo *root, PathTarget *target)
         }
     }

-    Assert(tuple_width >= 0);
-    target->width = tuple_width;
+    target->width = clamp_width_est(tuple_width);

     return target;
 }
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index a8cea5efe1..80818776e8 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -4610,6 +4610,7 @@ create_one_window_path(PlannerInfo *root,
              * Note: a WindowFunc adds nothing to the target's eval costs; but
              * we do need to account for the increase in tlist width.
              */
+            double        tuple_width = window_target->width;
             ListCell   *lc2;

             window_target = copy_pathtarget(window_target);
@@ -4618,8 +4619,9 @@ create_one_window_path(PlannerInfo *root,
                 WindowFunc *wfunc = lfirst_node(WindowFunc, lc2);

                 add_column_to_pathtarget(window_target, (Expr *) wfunc, 0);
-                window_target->width += get_typavgwidth(wfunc->wintype, -1);
+                tuple_width += get_typavgwidth(wfunc->wintype, -1);
             }
+            window_target->width = clamp_width_est(tuple_width);
         }
         else
         {
diff --git a/src/backend/optimizer/util/placeholder.c b/src/backend/optimizer/util/placeholder.c
index b1723578e6..5d7db4e94f 100644
--- a/src/backend/optimizer/util/placeholder.c
+++ b/src/backend/optimizer/util/placeholder.c
@@ -375,6 +375,7 @@ add_placeholders_to_joinrel(PlannerInfo *root, RelOptInfo *joinrel,
                             SpecialJoinInfo *sjinfo)
 {
     Relids        relids = joinrel->relids;
+    double        tuple_width = joinrel->reltarget->width;
     ListCell   *lc;

     foreach(lc, root->placeholder_list)
@@ -419,7 +420,7 @@ add_placeholders_to_joinrel(PlannerInfo *root, RelOptInfo *joinrel,
                     cost_qual_eval_node(&cost, (Node *) phv->phexpr, root);
                     joinrel->reltarget->cost.startup += cost.startup;
                     joinrel->reltarget->cost.per_tuple += cost.per_tuple;
-                    joinrel->reltarget->width += phinfo->ph_width;
+                    tuple_width += phinfo->ph_width;
                 }
             }

@@ -443,6 +444,8 @@ add_placeholders_to_joinrel(PlannerInfo *root, RelOptInfo *joinrel,
                                 phinfo->ph_lateral);
         }
     }
+
+    joinrel->reltarget->width = clamp_width_est(tuple_width);
 }

 /*
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index 7159c775fb..77478738ec 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1137,7 +1137,7 @@ estimate_rel_size(Relation rel, int32 *attr_widths,
 int32
 get_rel_data_width(Relation rel, int32 *attr_widths)
 {
-    int32        tuple_width = 0;
+    double        tuple_width = 0;
     int            i;

     for (i = 1; i <= RelationGetNumberOfAttributes(rel); i++)
@@ -1167,7 +1167,7 @@ get_rel_data_width(Relation rel, int32 *attr_widths)
         tuple_width += item_width;
     }

-    return tuple_width;
+    return clamp_width_est(tuple_width);
 }

 /*
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index 5d83f60eb9..1b35b08c75 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -22,6 +22,7 @@
 #include "optimizer/clauses.h"
 #include "optimizer/cost.h"
 #include "optimizer/inherit.h"
+#include "optimizer/optimizer.h"
 #include "optimizer/pathnode.h"
 #include "optimizer/paths.h"
 #include "optimizer/placeholder.h"
@@ -1092,6 +1093,7 @@ build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel,
                     bool can_null)
 {
     Relids        relids = joinrel->relids;
+    double        tuple_width = joinrel->reltarget->width;
     ListCell   *vars;
     ListCell   *lc;

@@ -1144,7 +1146,7 @@ build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel,
                 joinrel->reltarget->exprs = lappend(joinrel->reltarget->exprs,
                                                     phv);
                 /* Bubbling up the precomputed result has cost zero */
-                joinrel->reltarget->width += phinfo->ph_width;
+                tuple_width += phinfo->ph_width;
             }
             continue;
         }
@@ -1165,7 +1167,7 @@ build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel,
                 list_nth(root->row_identity_vars, var->varattno - 1);

             /* Update reltarget width estimate from RowIdentityVarInfo */
-            joinrel->reltarget->width += ridinfo->rowidwidth;
+            tuple_width += ridinfo->rowidwidth;
         }
         else
         {
@@ -1181,7 +1183,7 @@ build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel,
                 continue;        /* nope, skip it */

             /* Update reltarget width estimate from baserel's attr_widths */
-            joinrel->reltarget->width += baserel->attr_widths[ndx];
+            tuple_width += baserel->attr_widths[ndx];
         }

         /*
@@ -1221,6 +1223,8 @@ build_joinrel_tlist(PlannerInfo *root, RelOptInfo *joinrel,

         /* Vars have cost zero, so no need to adjust reltarget->cost */
     }
+
+    joinrel->reltarget->width = clamp_width_est(tuple_width);
 }

 /*
diff --git a/src/include/optimizer/optimizer.h b/src/include/optimizer/optimizer.h
index 6e8b81c51d..68c1443183 100644
--- a/src/include/optimizer/optimizer.h
+++ b/src/include/optimizer/optimizer.h
@@ -90,6 +90,7 @@ extern PGDLLIMPORT double recursive_worktable_factor;
 extern PGDLLIMPORT int effective_cache_size;

 extern double clamp_row_est(double nrows);
+extern int32 clamp_width_est(double tuple_width);
 extern long clamp_cardinality_to_long(Cardinality x);

 /* in path/indxpath.c: */

Re: BUG #18247: Integer overflow leads to negative width

From
Richard Guo
Date:

On Tue, Dec 19, 2023 at 5:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Some review comments:

* clamp_width_est doesn't need to check isinf, since the
comparison will work just fine.  It does need to check isnan,
unless we switch to int64.

* We can fold the Assert about width >= 0 into clamp_width_est
too; it fits naturally there and saves code in the callers.

* I did not like changing the output type of get_rel_data_width;
it seems better to me to have it do clamp_width_est internally.
For one reason, several of the call sites are relying on
doing integer division, which the v2 patch might break.

The attached v3 fixes those things and makes some cosmetic
changes (mostly comments).

I agree with all the comments made.  I have also examined other places
where the width field is assigned, and I think we've covered all cases.
So the v3 patch is in good shape to me.

Thanks
Richard

Re: BUG #18247: Integer overflow leads to negative width

From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes:
> I agree with all the comments made.  I have also examined other places
> where the width field is assigned, and I think we've covered all cases.
> So the v3 patch is in good shape to me.

Thanks for looking!  Do you have an opinion about the int64-vs-double
question?

            regards, tom lane



Re: BUG #18247: Integer overflow leads to negative width

From
Richard Guo
Date:

On Tue, Dec 19, 2023 at 12:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> I agree with all the comments made.  I have also examined other places
> where the width field is assigned, and I think we've covered all cases.
> So the v3 patch is in good shape to me.

Thanks for looking!  Do you have an opinion about the int64-vs-double
question?

To be honest, I don't have a preference on which one is better.  I think
double is good enough for now as we don't need to worry about overflow
with it.

Thanks
Richard

Re: BUG #18247: Integer overflow leads to negative width

From
Tom Lane
Date:
Richard Guo <guofenglinux@gmail.com> writes:
> On Tue, Dec 19, 2023 at 12:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thanks for looking!  Do you have an opinion about the int64-vs-double
>> question?

> To be honest, I don't have a preference on which one is better.  I think
> double is good enough for now as we don't need to worry about overflow
> with it.

After sleeping on it, I'm coming around to the idea that int64 will
be better.  The argument that convinces me is that using int64
provides a datatype-based clue that we are working with a width
and not a row count, cost, or selectivity number.  I don't feel
a need to go as far as invent a typedef alias like Cardinality;
but plain "double" in the planner tends to be a rowcount estimate,
which is not what we want people to think of.

I'll make that change and push it.

BTW, I think it's sufficient to fix this in HEAD.  The troublesome
example seems quite artificial to me, and we've not heard field
reports suggesting that people have had real problems here.

            regards, tom lane



Re: BUG #18247: Integer overflow leads to negative width

From
Richard Guo
Date:

On Tue, Dec 19, 2023 at 11:22 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Richard Guo <guofenglinux@gmail.com> writes:
> On Tue, Dec 19, 2023 at 12:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Thanks for looking!  Do you have an opinion about the int64-vs-double
>> question?

> To be honest, I don't have a preference on which one is better.  I think
> double is good enough for now as we don't need to worry about overflow
> with it.

After sleeping on it, I'm coming around to the idea that int64 will
be better.  The argument that convinces me is that using int64
provides a datatype-based clue that we are working with a width
and not a row count, cost, or selectivity number.  I don't feel
a need to go as far as invent a typedef alias like Cardinality;
but plain "double" in the planner tends to be a rowcount estimate,
which is not what we want people to think of.

Fair point.
 
I'll make that change and push it.

Thanks for the change and pushing!
 
BTW, I think it's sufficient to fix this in HEAD.  The troublesome
example seems quite artificial to me, and we've not heard field
reports suggesting that people have had real problems here.

Agreed.

Thanks
Richard