array_agg(anyarray) silently produces corrupt results with parallel workers when inputs mix NULL and non-NULL array elements - Mailing list pgsql-bugs

From Dmytro Astapov
Subject array_agg(anyarray) silently produces corrupt results with parallel workers when inputs mix NULL and non-NULL array elements
Date
Msg-id CAFQUnFj2pQ1HbGp69+w2fKqARSfGhAi9UOb+JjyExp7kx3gsqA@mail.gmail.com
Whole thread Raw
Responses Re: array_agg(anyarray) silently produces corrupt results with parallel workers when inputs mix NULL and non-NULL array elements
List pgsql-bugs
PostgreSQL version: 17.2 (also verified on 17.9 and 18.3)
Operating system:   Linux x86_64 (Red Hat 8)

Short description
-----------------

array_agg(ARRAY[...]) produces silently corrupted 2-D arrays when the query uses parallel partial aggregation and the input arrays contain a mix of NULL and non-NULL elements. NULL values appear at the wrong positions in the output, and non-NULL values disappear or shift.

The corruption is non-deterministic: two identical queries against the same data can return different results.

Disabling parallelism with SET max_parallel_workers_per_gather = 0 eliminates the problem.

How to reproduce
----------------

The bug requires the planner to choose a Partial GroupAggregate (or Partial HashAggregate) plan with parallel workers. This typically needs a table large enough for the planner to prefer parallel aggregation (I used ~10M rows in my tests).

-- Create and populate a test table with ~10M rows.
-- Three NULL patterns in m1/m2/m3 columns:
--   ~68% rows: m1=val, m2=val, m3='' (no NULLs)
--   ~19% rows: m1=NULL, m2=NULL, m3=NULL (all NULL)
--   ~13% rows: m1=val, m2='', m3=val (no NULLs)

CREATE TABLE test_data (
    group_id    int NOT NULL,
    branch_name text NOT NULL,
    branch_type text NOT NULL,
    m1          text,
    m2          text,
    m3          text
);

INSERT INTO test_data
SELECT
    (i / 4),
    'BRANCH_' || i,
    'Type' || (i % 3),
    CASE WHEN i%100<68 THEN 'val_'||i WHEN i%100<87 THEN NULL ELSE 'val_'||i END,
    CASE WHEN i%100<68 THEN 'v2_'||i  WHEN i%100<87 THEN NULL ELSE '' END,
    CASE WHEN i%100<68 THEN ''        WHEN i%100<87 THEN NULL ELSE 'v3_'||i END
FROM generate_series(1, 2000000) AS g(i);

-- Replicate to ~10M rows so the planner picks a parallel plan.
INSERT INTO test_data
SELECT group_id + 500001, branch_name || '_r' || r, branch_type, m1, m2, m3
FROM test_data, generate_series(1, 4) r;

ANALYZE test_data;

-- Verify the plan uses Partial GroupAggregate with parallel workers:
SET max_parallel_workers_per_gather = 4;
EXPLAIN (COSTS OFF)
SELECT group_id % 100000 AS gid, array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s
FROM test_data GROUP BY group_id % 100000;

-- Expected plan:
--   Finalize GroupAggregate
--     ->  Gather Merge
--           Workers Planned: 4
--           ->  Partial GroupAggregate
--                 ->  Sort
--                       ->  Parallel Seq Scan on test_data

-- Create ground truth (no parallelism):
SET max_parallel_workers_per_gather = 0;
CREATE TABLE gt AS
SELECT group_id % 100000 AS gid,
       array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s
FROM test_data GROUP BY group_id % 100000;

-- Create parallel result:
SET max_parallel_workers_per_gather = 4;
CREATE TABLE pr AS
SELECT group_id % 100000 AS gid,
       array_agg(ARRAY[m1, m2, m3]) AS m1m2m3s
FROM test_data GROUP BY group_id % 100000;

-- Order-independent comparison (eliminates row-ordering differences,
-- detects only value corruption):
SELECT count(*) AS corrupted_groups
FROM (
    SELECT gid, array_agg(COALESCE(v, '!!NULL!!') ORDER BY v) AS sv
    FROM gt, unnest(m1m2m3s) v GROUP BY gid
) a
JOIN (
    SELECT gid, array_agg(COALESCE(v, '!!NULL!!') ORDER BY v) AS sv
    FROM pr, unnest(m1m2m3s) v GROUP BY gid
) b ON a.gid = b.gid
WHERE a.sv != b.sv;

On every run I have tested, corrupted_groups is in the range of 4000-6000 out of 100000 total groups (~5%). Results differ between runs, confirming non-determinism.

I verified this on three versions: REL_17_2, REL_17_9, REL_18_3. All three versions were built with: ./configure --enable-debug --enable-cassert


Root cause
----------

The bug is seemingly in array_agg_array_combine() in src/backend/utils/adt/array_userfuncs.c.

The combine function is used during parallel aggregation of array_agg(anyarray).
It was introduced in commit 16fd03e9565 ("Allow parallel aggregate on string_agg and array_agg", 2023-01-23), first shipped in PG 16.

When two partial aggregation states are combined, array_agg_array_combine must merge their null bitmaps. The current code only enters the bitmap-handling block when state2 (the incoming partial state) has a nullbitmap:

    if (state2->nullbitmap)
    {
        ...
    }

I think this misses the case where state1 (the running state) already has a nullbitmap but state2 does not. In that scenario, state2's data bytes are appended to state1's data buffer and state1->nitems is incremented, but the nullbitmap is NOT extended to cover state2's items. The bit positions for state2's items are left as uninitialized memory, which randomly marks some elements as NULL. This shifts the interpretation of the data buffer and corrupts the output.

For comparison, the non-parallel accumArrayResultArr() in arrayfuncs.c has this condition:

    if (astate->nullbitmap || ARR_HASNULL(arg))

which enters the bitmap block whenever EITHER the existing state has a bitmap OR the new input has NULLs.

This bug triggers when parallel workers split a group's rows such that one worker sees only NULL-containing arrays (building a state with a nullbitmap) and another sees only non-NULL arrays (no nullbitmap), and the combine function is called with the NULL-containing state as state1 and the non-NULL state as state2. Since row distribution accross workers is non-deterministic, the corruption is too.

Fixing this condition would require a second, related change in the same function. When extending the bitmap, the code computes:

    int newaitems = state1->aitems + state2->aitems;

With the corrected condition (state1->nullbitmap || state2->nullbitmap), state2->aitems can now be 0 (no bitmap was allocated for state2). This makes newaitems equal to state1->aitems, which may be less than newnitems (state1->nitems + state2->nitems).

The subsequent pg_nextpower2_32(newaitems) then allocates a bitmap which will be too small, and array_bitmap_copy writes past the end of it.

This could be verified with --enable-cassert, running the reproduction steps will produce "problem in alloc set ExprContext: req size > alloc size" warnings.

Fix
---

I think that the following two changes to array_agg_array_combine() in array_userfuncs.c fix the issue:

1. Change the condition guarding the null bitmap block from
   "if (state2->nullbitmap)" to
   "if (state1->nullbitmap || state2->nullbitmap)".

2. Change the bitmap reallocation size from
   "state1->aitems + state2->aitems" to
   "Max(state1->aitems + state2->aitems, newnitems)"
   to ensure the bitmap is always large enough.

Patch (applies cleanly to REL_16_STABLE through REL_18_STABLE and
master), I am also including the same as an attachment:

--- a/src/backend/utils/adt/array_userfuncs.c
+++ b/src/backend/utils/adt/array_userfuncs.c
@@ -997,7 +997,7 @@
  state1->data = (char *) repalloc(state1->data, state1->abytes);
  }
 
- if (state2->nullbitmap)
+ if (state1->nullbitmap || state2->nullbitmap)
  {
  int newnitems = state1->nitems + state2->nitems;
 
@@ -1015,7 +1015,8 @@
  }
  else if (newnitems > state1->aitems)
  {
- int newaitems = state1->aitems + state2->aitems;
+ int newaitems = Max(state1->aitems + state2->aitems,
+   newnitems);
 
  state1->aitems = pg_nextpower2_32(newaitems);
  state1->nullbitmap = (bits8 *)

I have verified that this patch eliminates the corruption on all three
versions tested (17.2, 17.9, 18.3): the corrupted_groups count drops
from ~5000 to 0 in every run.


Best regards, Dmytro
Attachment

pgsql-bugs by date:

Previous
From: Heikki Linnakangas
Date:
Subject: Re: Memory leak in pg_stat_statements when qtext file contains invalid encoding
Next
From: Lukas Fittl
Date:
Subject: Re: Memory leak in pg_stat_statements when qtext file contains invalid encoding