Thread: BUG #18360: Invalid memory access occurs when using geqo

BUG #18360: Invalid memory access occurs when using geqo

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      18360
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 16.2
Operating system:   Ubuntu 22.04
Description:

The following script:
SET enable_partitionwise_join = on;
SET geqo_threshold = 2;

CREATE TABLE t (i int) PARTITION BY LIST (i);
CREATE TABLE tp1 PARTITION OF t FOR VALUES IN (1);
SELECT t1.* FROM t t1 WHERE EXISTS (SELECT 1 FROM t t2 WHERE t1.i = t2.i);

leads to a server crash with the following stack trace:
Core was generated by `postgres: law regression [local] SELECT
                        '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055a4a06b0ecd in create_unique_plan (root=0x55a4a1ae1ba8,
best_path=0x55a4a1ae8ca0, flags=2) at createplan.c:1747
1747                   Expr       *uniqexpr = lfirst(l);
(gdb) bt
#0  0x000055a4a06b0ecd in create_unique_plan (root=0x55a4a1ae1ba8,
best_path=0x55a4a1ae8ca0, flags=2) at createplan.c:1747
#1  0x000055a4a06aedb2 in create_plan_recurse (root=0x55a4a1ae1ba8,
best_path=0x55a4a1ae8ca0, flags=2) at createplan.c:477
#2  0x000055a4a06b7309 in create_hashjoin_plan (root=0x55a4a1ae1ba8,
best_path=0x55a4a1af8e40) at createplan.c:4734
#3  0x000055a4a06afa93 in create_join_plan (root=0x55a4a1ae1ba8,
best_path=0x55a4a1af8e40) at createplan.c:1072
#4  0x000055a4a06aebdc in create_plan_recurse (root=0x55a4a1ae1ba8,
best_path=0x55a4a1af8e40, flags=1) at createplan.c:416
#5  0x000055a4a06aff84 in create_append_plan (root=0x55a4a1ae1ba8,
best_path=0x55a4a1af53d8, flags=1) at createplan.c:1288
#6  0x000055a4a06aebfb in create_plan_recurse (root=0x55a4a1ae1ba8,
best_path=0x55a4a1af53d8, flags=1) at createplan.c:420
#7  0x000055a4a06aeaa8 in create_plan (root=0x55a4a1ae1ba8,
best_path=0x55a4a1af53d8) at createplan.c:347
#8  0x000055a4a06c34dd in standard_planner (parse=0x55a4a19f3f50,
query_string=0x55a4a19f2a08 "SELECT t1.* FROM t t1 WHERE EXISTS (SELECT 1
FROM t t2 WHERE t1.i = t2.i);", cursorOptions=2048, boundParams=0x0) at
planner.c:420
#9  0x000055a4a06c31b2 in planner (parse=0x55a4a19f3f50,
query_string=0x55a4a19f2a08 "SELECT t1.* FROM t t1 WHERE EXISTS (SELECT 1
FROM t t2 WHERE t1.i = t2.i);", cursorOptions=2048, boundParams=0x0) at
planner.c:281
#10 0x000055a4a0812e93 in pg_plan_query (querytree=0x55a4a19f3f50,
query_string=0x55a4a19f2a08 "SELECT t1.* FROM t t1 WHERE EXISTS (SELECT 1
FROM t t2 WHERE t1.i = t2.i);", cursorOptions=2048, boundParams=0x0) at
postgres.c:904
#11 0x000055a4a0812fed in pg_plan_queries (querytrees=0x55a4a1acb308,
query_string=0x55a4a19f2a08 "SELECT t1.* FROM t t1 WHERE EXISTS (SELECT 1
FROM t t2 WHERE t1.i = t2.i);", cursorOptions=2048, boundParams=0x0) at
postgres.c:996
#12 0x000055a4a0813402 in exec_simple_query (query_string=0x55a4a19f2a08
"SELECT t1.* FROM t t1 WHERE EXISTS (SELECT 1 FROM t t2 WHERE t1.i =
t2.i);") at postgres.c:1193
#13 0x000055a4a0818600 in PostgresMain (dbname=0x55a4a1a2db78 "regression",
username=0x55a4a19ef068 "law") at postgres.c:4637
#14 0x000055a4a073933b in BackendRun (port=0x55a4a1a25dc0) at
postmaster.c:4464
#15 0x000055a4a0738bc7 in BackendStartup (port=0x55a4a1a25dc0) at
postmaster.c:4192

(gdb) p best_path->uniq_exprs
$1 = (List *) 0x55a4a1aedca0
(gdb) p *best_path->uniq_exprs
$2 = {type = 2139062143, length = 2139062143, max_length = 2139062143,
elements = 0x7f7f7f7f7f7f7f7f, 
  initial_elements = 0x55a4a1aedcb8}

Valgrind detects an invalid read, with the following diagnostics:
2024-02-23 15:13:54.387 MSK|law|regression|65d88bd7.2d9b63|STATEMENT:
SELECT t1.* FROM t t1 WHERE EXISTS (SELECT 1 FROM t t2 WHERE t1.i = t2.i);
==00:00:00:47.886 2988899== Invalid read of size 4
==00:00:00:47.887 2988899==    at 0x4EF847: create_unique_plan
(createplan.c:1745)
==00:00:00:47.887 2988899==    by 0x4EF5D8: create_plan_recurse
(createplan.c:477)
==00:00:00:47.887 2988899==    by 0x4F07F6: create_hashjoin_plan
(createplan.c:4734)
...
==00:00:00:47.887 2988899==  Address 0x10b16524 is 5,540 bytes inside a
recently re-allocated block of size 8,192 alloc'd
==00:00:00:47.887 2988899==    at 0x4848899: malloc (in
/usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==00:00:00:47.887 2988899==    by 0x762646: AllocSetContextCreateInternal
(aset.c:438)
==00:00:00:47.887 2988899==    by 0x4C7DD1: geqo_eval (geqo_eval.c:75)
==00:00:00:47.887 2988899==    by 0x4C829D: random_init_pool
(geqo_pool.c:109)
==00:00:00:47.887 2988899==    by 0x4C7FB4: geqo (geqo_main.c:114)
==00:00:00:47.887 2988899==    by 0x4CD14D: make_rel_from_joinlist
(allpaths.c:3383)
==00:00:00:47.887 2988899==    by 0x4CD226: make_one_rel (allpaths.c:229)
==00:00:00:47.887 2988899==    by 0x4F7878: query_planner (planmain.c:278)
==00:00:00:47.887 2988899==    by 0x4FF6C3: grouping_planner
(planner.c:1495)
==00:00:00:47.887 2988899==    by 0x501084: subquery_planner
(planner.c:1064)
==00:00:00:47.887 2988899==    by 0x501680: standard_planner
(planner.c:413)
==00:00:00:47.887 2988899==    by 0x501C4A: planner (planner.c:281)

It looks like uniq_exprs points to memory located in a short-lived context
created in geqo_eval().


Re: BUG #18360: Invalid memory access occurs when using geqo

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> The following script:
> SET enable_partitionwise_join = on;
> SET geqo_threshold = 2;

> CREATE TABLE t (i int) PARTITION BY LIST (i);
> CREATE TABLE tp1 PARTITION OF t FOR VALUES IN (1);
> SELECT t1.* FROM t t1 WHERE EXISTS (SELECT 1 FROM t t2 WHERE t1.i = t2.i);

> leads to a server crash with the following stack trace:

Yup, reproduces here.

> It looks like uniq_exprs points to memory located in a short-lived context
> created in geqo_eval().

Indeed.  Curiously, the in_operators list, which I thought was
parallel to that, seems fine.  Anyway, something's being careless
about which context it creates that data structure in.  Shouldn't
be too hard to fix.

I wonder whether we need a debugging mode that frees path detritus
under the same rules as GEQO does, even for single-relation queries.
This sort of problem can escape notice for a long time.

            regards, tom lane



Re: BUG #18360: Invalid memory access occurs when using geqo

From
Tom Lane
Date:
I wrote:
> Indeed.  Curiously, the in_operators list, which I thought was
> parallel to that, seems fine.  Anyway, something's being careless
> about which context it creates that data structure in.  Shouldn't
> be too hard to fix.

The problem seems to arise from build_child_join_sjinfo(), which
makes a translated version of the semi_rhs_exprs that is later
propagated into a UniquePath for a base relation.  This breaks
GEQO's intention that base-relation structs will be long-lived
while only join-relation data is short-lived.  (in_operators
is not modified so the original long-lived list is used for that,
explaining why it's not trashed at the same time.)

The simplest fix is as attached: just do a quick copyObject
in create_unique_path.  That's rather ugly, but create_unique_path
is already taking explicit responsibility for the context that the
Path is built in, so it doesn't quite exceed my threshold of pain.
(The alternative of making build_child_join_sjinfo force its output
to be long-lived doesn't look good: that gets invoked quite a few
times during a GEQO cycle.)  I chose to make it copy the in_operators
list as well, even though that's not minimally necessary to fix
the bug --- it would look odd if we didn't, and the copy is pretty
cheap since it's only a short integer list.

            regards, tom lane

diff --git a/src/backend/optimizer/util/pathnode.c b/src/backend/optimizer/util/pathnode.c
index b3902b0032..6f79b2e3fe 100644
--- a/src/backend/optimizer/util/pathnode.c
+++ b/src/backend/optimizer/util/pathnode.c
@@ -1707,8 +1707,13 @@ create_unique_path(PlannerInfo *root, RelOptInfo *rel, Path *subpath,
     pathnode->path.pathkeys = NIL;

     pathnode->subpath = subpath;
-    pathnode->in_operators = sjinfo->semi_operators;
-    pathnode->uniq_exprs = sjinfo->semi_rhs_exprs;
+
+    /*
+     * Under GEQO, the sjinfo might be short-lived, so we'd better make copies
+     * of data structures we extract from it.
+     */
+    pathnode->in_operators = copyObject(sjinfo->semi_operators);
+    pathnode->uniq_exprs = copyObject(sjinfo->semi_rhs_exprs);

     /*
      * If the input is a relation and it has a unique index that proves the