Thread: BUG #16784: Server crash in ExecReScanAgg()

BUG #16784: Server crash in ExecReScanAgg()

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16784
Logged by:          Alexander Lakhin
Email address:      exclusion@gmail.com
PostgreSQL version: 13.1
Operating system:   Ubuntu 20.04
Description:

The following query (borrowed from regression tests):
CREATE TABLE onek (
    unique1     int4,
    unique2     int4,
    two         int4,
    four        int4,
    ten         int4,
    twenty      int4,
    hundred     int4,
    thousand    int4,
    twothousand int4,
    fivethous   int4,
    tenthous    int4,
    odd         int4,
    even        int4,
    stringu1    name,
    stringu2    name,
    string4     name
);
COPY onek FROM '.../src/test/regress/data/onek.data';

SET work_mem='64kB';

select * from (values (1),(2)) v(a) left join lateral (select v.a, four,
ten, count(*) from onek group by cube(four,ten)) s on true order by
v.a,four,ten;

leads to a server crash with the following stacktrace:

Core was generated by `postgres: law regression [local] SELECT
                        '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  __memset_avx2_unaligned_erms () at
../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:259
259     ../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S: No such
file or directory.
(gdb) bt
#0  __memset_avx2_unaligned_erms () at
../sysdeps/x86_64/multiarch/memset-vec-unaligned-erms.S:259
#1  0x000055814662ab37 in memset (__len=<optimized out>, __ch=0,
__dest=<optimized out>)
    at /usr/include/x86_64-linux-gnu/bits/string_fortified.h:71
#2  ExecReScanAgg (node=node@entry=0x558148640c18) at nodeAgg.c:4743
#3  0x000055814660a14d in ExecReScan (node=node@entry=0x558148640c18) at
execAmi.c:265
#4  0x000055814663f0d8 in ExecNestLoop (pstate=0x5581486404b8) at
nodeNestloop.c:152
#5  0x0000558146641448 in ExecProcNode (node=0x5581486404b8) at
../../../src/include/executor/executor.h:248
#6  ExecSort (pstate=0x5581486402a8) at nodeSort.c:108
#7  0x0000558146616042 in ExecProcNode (node=0x5581486402a8) at
../../../src/include/executor/executor.h:248
#8  ExecutePlan (execute_once=<optimized out>, dest=0x558148718af0,
direction=<optimized out>, numberTuples=0, 
    sendTuples=<optimized out>, operation=CMD_SELECT,
use_parallel_mode=<optimized out>, planstate=0x5581486402a8, 
    estate=0x558148640018) at execMain.c:1646
#9  standard_ExecutorRun (queryDesc=0x558148704af8, direction=<optimized
out>, count=0, execute_once=<optimized out>)
    at execMain.c:364
#10 0x0000558146772abc in PortalRunSelect (portal=0x55814867d898,
forward=<optimized out>, count=0, 
    dest=<optimized out>) at pquery.c:912
#11 0x0000558146773dfe in PortalRun (portal=portal@entry=0x55814867d898,
count=count@entry=9223372036854775807, 
    isTopLevel=isTopLevel@entry=true, run_once=run_once@entry=true,
dest=dest@entry=0x558148718af0, 
    altdest=altdest@entry=0x558148718af0, qc=0x7ffd679d6dd0) at
pquery.c:756
#12 0x000055814676f949 in exec_simple_query (
    query_string=0x55814861a538 "select * from (values (1),(2)) v(a) left
join lateral (select v.a, four, ten, count(*) from onek group by
cube(four,ten)) s on true order by v.a,four,ten;") at postgres.c:1239
#13 0x0000558146771a47 in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x558148645928, dbname=<optimized out>, 
    username=<optimized out>) at postgres.c:4315
#14 0x00005581466fc148 in BackendRun (port=0x55814863e0a0) at
postmaster.c:4536
#15 BackendStartup (port=0x55814863e0a0) at postmaster.c:4220
#16 ServerLoop () at postmaster.c:1739
#17 0x00005581466fd08a in PostmasterMain (argc=<optimized out>,
argv=0x558148614970) at postmaster.c:1412
#18 0x0000558146491982 in main (argc=3, argv=0x558148614970) at main.c:210

With `ANALYZE onek;` after COPY the query executes without an error. That's
why the same query in groupingsets.sql doesn't fail.
The first bad commit is 2fd6a44a.


Re: BUG #16784: Server crash in ExecReScanAgg()

From
Tom Lane
Date:
PG Bug reporting form <noreply@postgresql.org> writes:
> The following query (borrowed from regression tests):
> ...
> leads to a server crash with the following stacktrace:

Duplicated here.

> The first bad commit is 2fd6a44a.

I suspect the culprit is 1f39bce02 (but it's still Jeff's fault ;-)).
What I see happening is that the second time through ExecReScanAgg
crashes here:

            MemSet(node->pergroups[setno], 0,
                   sizeof(AggStatePerGroupData) * node->numaggs);

because node->pergroups[0] is now NULL, which is the fault of
this bit in agg_refill_hash_table:

    /* there could be residual pergroup pointers; clear them */
    for (int setoff = 0;
         setoff < aggstate->maxsets + aggstate->num_hashes;
         setoff++)
        aggstate->all_pergroups[setoff] = NULL;

I suspect this is clearing the wrong subset of the all_pergroups
pointers, but the code is so underdocumented that I'm not very
sure.

            regards, tom lane



Re: BUG #16784: Server crash in ExecReScanAgg()

From
Jeff Davis
Date:
Thank you for the report, Alexander!

On Mon, 2020-12-21 at 14:26 -0500, Tom Lane wrote:
>     /* there could be residual pergroup pointers; clear them */
>     for (int setoff = 0;
>          setoff < aggstate->maxsets + aggstate->num_hashes;
>          setoff++)
>         aggstate->all_pergroups[setoff] = NULL;
> 
> I suspect this is clearing the wrong subset of the all_pergroups
> pointers, but the code is so underdocumented that I'm not very
> sure.

That's correct, but there was a (bad) reason it was done that way that
I had to fix first. A null pergroup was used as a signal not to advance
a group that has spilled, but that's only a good solution for the
hashed grouping sets, not the sorted grouping sets (which is what
caused this bug).

Instead, I solved it by simply not compiling the expressions for the
sorted grouping sets, so that agg_refill_hash_table() can leave those
pergroups alone.

Regards,
    Jeff Davis