Re: array_agg(anyarray) silently produces corrupt results with parallel workers when inputs mix NULL and non-NULL array elements - Mailing list pgsql-bugs

From Tom Lane
Subject Re: array_agg(anyarray) silently produces corrupt results with parallel workers when inputs mix NULL and non-NULL array elements
Date
Msg-id 4102596.1775320901@sss.pgh.pa.us
Whole thread Raw
In response to array_agg(anyarray) silently produces corrupt results with parallel workers when inputs mix NULL and non-NULL array elements  (Dmytro Astapov <dastapov@gmail.com>)
Responses Re: array_agg(anyarray) silently produces corrupt results with parallel workers when inputs mix NULL and non-NULL array elements
List pgsql-bugs
Dmytro Astapov <dastapov@gmail.com> writes:
> array_agg(ARRAY[...]) produces silently corrupted 2-D arrays when the query
> uses parallel partial aggregation and the input arrays contain a mix of
> NULL and non-NULL elements. NULL values appear at the wrong positions in
> the output, and non-NULL values disappear or shift.

Right you are.

> I think that the following two changes to array_agg_array_combine() in
> array_userfuncs.c fix the issue:

> 1. Change the condition guarding the null bitmap block from
>    "if (state2->nullbitmap)" to
>    "if (state1->nullbitmap || state2->nullbitmap)".

> 2. Change the bitmap reallocation size from
>    "state1->aitems + state2->aitems" to
>    "Max(state1->aitems + state2->aitems, newnitems)"
>    to ensure the bitmap is always large enough.

This seems basically right, but I think we could simplify the code
some more: AFAICS the required bitmap size is newnitems, full stop.
The initial-setup path is confused about that too, allocating
newnitems+1 which is pointless.

It also troubled me that there's no checks for integer overflow
when calculating the new sizes.  I believe that the pg_nextpower2_32
bits are okay even with large inputs (we'll end in palloc rejecting
the request size if there's an overflow), but if reqsize or newnitems
overflows it could be bad.

So I end with the attached revised patch, where I also made one
or two cosmetic adjustments like putting the type-comparison checks
next to the dimension comparisons.  Look good to you?

            regards, tom lane

diff --git a/src/backend/utils/adt/array_userfuncs.c b/src/backend/utils/adt/array_userfuncs.c
index 50c81a088df..9d7f67cc188 100644
--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -24,6 +24,7 @@
 #include "utils/builtins.h"
 #include "utils/datum.h"
 #include "utils/lsyscache.h"
+#include "utils/memutils.h"
 #include "utils/tuplesort.h"
 #include "utils/typcache.h"

@@ -1033,10 +1034,11 @@ array_agg_array_combine(PG_FUNCTION_ARGS)
     }

     /* We only need to combine the two states if state2 has any items */
-    else if (state2->nitems > 0)
+    if (state2->nitems > 0)
     {
         MemoryContext oldContext;
-        int            reqsize = state1->nbytes + state2->nbytes;
+        int            reqsize;
+        int            newnitems;
         int            i;

         /*
@@ -1059,6 +1061,17 @@ array_agg_array_combine(PG_FUNCTION_ARGS)
                          errmsg("cannot accumulate arrays of different dimensionality")));
         }

+        /* Types should match already. */
+        Assert(state1->array_type == state2->array_type);
+        Assert(state1->element_type == state2->element_type);
+
+        /* Calculate new sizes, guarding against overflow. */
+        if (pg_add_s32_overflow(state1->nbytes, state2->nbytes, &reqsize) ||
+            pg_add_s32_overflow(state1->nitems, state2->nitems, &newnitems))
+            ereport(ERROR,
+                    (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+                     errmsg("array size exceeds the maximum allowed (%zu)",
+                            MaxArraySize)));

         oldContext = MemoryContextSwitchTo(state1->mcontext);

@@ -1073,17 +1086,16 @@ array_agg_array_combine(PG_FUNCTION_ARGS)
             state1->data = (char *) repalloc(state1->data, state1->abytes);
         }

-        if (state2->nullbitmap)
+        /* Combine the null bitmaps, if present. */
+        if (state1->nullbitmap || state2->nullbitmap)
         {
-            int            newnitems = state1->nitems + state2->nitems;
-
             if (state1->nullbitmap == NULL)
             {
                 /*
                  * First input with nulls; we must retrospectively handle any
                  * previous inputs by marking all their items non-null.
                  */
-                state1->aitems = pg_nextpower2_32(Max(256, newnitems + 1));
+                state1->aitems = pg_nextpower2_32(Max(256, newnitems));
                 state1->nullbitmap = (uint8 *) palloc((state1->aitems + 7) / 8);
                 array_bitmap_copy(state1->nullbitmap, 0,
                                   NULL, 0,
@@ -1091,17 +1103,17 @@ array_agg_array_combine(PG_FUNCTION_ARGS)
             }
             else if (newnitems > state1->aitems)
             {
-                int            newaitems = state1->aitems + state2->aitems;
-
-                state1->aitems = pg_nextpower2_32(newaitems);
+                state1->aitems = pg_nextpower2_32(newnitems);
                 state1->nullbitmap = (uint8 *)
                     repalloc(state1->nullbitmap, (state1->aitems + 7) / 8);
             }
+            /* This will do the right thing if state2->nullbitmap is NULL: */
             array_bitmap_copy(state1->nullbitmap, state1->nitems,
                               state2->nullbitmap, 0,
                               state2->nitems);
         }

+        /* Finally, combine the data and adjust sizes. */
         memcpy(state1->data + state1->nbytes, state2->data, state2->nbytes);
         state1->nbytes += state2->nbytes;
         state1->nitems += state2->nitems;
@@ -1109,9 +1121,6 @@ array_agg_array_combine(PG_FUNCTION_ARGS)
         state1->dims[0] += state2->dims[0];
         /* remaining dims already match, per test above */

-        Assert(state1->array_type == state2->array_type);
-        Assert(state1->element_type == state2->element_type);
-
         MemoryContextSwitchTo(oldContext);
     }


pgsql-bugs by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: BUG #19449: Massive performance degradation for complex query on Postgres 16+ (few seconds -> multiple hours)
Next
From: Lukas Fittl
Date:
Subject: Re: pg_plan_advice fails when NestLoop outer side is Sort over FunctionScan