Thread: BUG #17619: AllocSizeIsValid violation in parallel hash join

BUG #17619: AllocSizeIsValid violation in parallel hash join

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      17619
Logged by:          Dmitry Astapov
Email address:      dastapov@gmail.com
PostgreSQL version: 13.8
Operating system:   Linux (CentOS 7)
Description:

This is how the issue looks like from the user perspective (happens with
both "explain analyze" and "select" as well):

# explain analyze select n, id, rhs from lhs left join (select id, date1,
date2, source,payload from rhs) as rhs using (id) where n > bigint
E'3240754568';
ERROR:  XX000: invalid memory alloc request size 1702125924
LOCATION:  MemoryContextAlloc, mcxt.c:804

Query plan (as shown by explain) is rather simple:
# explain select n, id, rhs from lhs left join (select id, date1, date2,
source,payload from rhs) as rhs using (id) where n > bigint E'3240754568';
                                       QUERY PLAN
            
─────────────────────────────────────────────────────────────────────────────────────────
 Gather  (cost=186066.70..198955.08 rows=13 width=52)
   Workers Planned: 2
   ->  Parallel Hash Left Join  (cost=185066.70..197953.78 rows=5
width=52)
         Hash Cond: (lhs.id = rhs.id)
         ->  Parallel Seq Scan on lhs  (cost=0.00..12887.04 rows=5
width=20)
               Filter: (n > '3240754568'::bigint)
         ->  Parallel Hash  (cost=180233.53..180233.53 rows=386653
width=44)
               ->  Parallel Seq Scan on rhs  (cost=0.00..180233.53
rows=386653 width=44)


I managed to add "print pid and sleep" into that function and attach gdb,
and this is the backtrace I got:

(gdb) bt
#0  0x00007fbb83da59e0 in __nanosleep_nocancel () from /lib64/libc.so.6
#1  0x00007fbb83da5894 in sleep () from /lib64/libc.so.6
#2  0x0000000000926584 in MemoryContextAlloc (context=0x134d190,
size=size@entry=1702125924) at mcxt.c:806
#3  0x000000000092cd9d in sts_read_tuple (meta_data=0x7ffc88d1264c,
accessor=0x1360e00) at sharedtuplestore.c:450
#4  sts_parallel_scan_next (accessor=accessor@entry=0x1360e00,
meta_data=meta_data@entry=0x7ffc88d1264c) at sharedtuplestore.c:530
#5  0x0000000000682911 in ExecParallelHashJoinNewBatch (hjstate=<optimized
out>, hjstate=<optimized out>) at nodeHashjoin.c:1153
#6  ExecHashJoinImpl (parallel=true, pstate=0x13112a0) at
nodeHashjoin.c:560
#7  ExecParallelHashJoin (pstate=0x13112a0) at nodeHashjoin.c:607
#8  0x0000000000665f88 in ExecProcNodeInstr (node=0x13112a0) at
execProcnode.c:466
#9  0x000000000067b8a4 in ExecProcNode (node=0x13112a0) at
../../../src/include/executor/executor.h:248
#10 gather_getnext (gatherstate=0x1311098) at nodeGather.c:295
#11 ExecGather (pstate=0x1311098) at nodeGather.c:227
#12 0x0000000000665f88 in ExecProcNodeInstr (node=0x1311098) at
execProcnode.c:466
#13 0x000000000065e542 in ExecProcNode (node=0x1311098) at
../../../src/include/executor/executor.h:248
#14 ExecutePlan (execute_once=<optimized out>, dest=0xa79fe0 <donothingDR>,
direction=<optimized out>, numberTuples=0, sendTuples=true,
operation=CMD_SELECT, use_parallel_mode=<optimized out>,
planstate=0x1311098, estate=0x1310e10) at execMain.c:1632
#15 standard_ExecutorRun (queryDesc=0x130f980, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:350
#16 0x00000000005f281f in ExplainOnePlan
(plannedstmt=plannedstmt@entry=0x13056d8, into=into@entry=0x0,
es=es@entry=0x12f63b0, 
    queryString=queryString@entry=0x122daa0 "explain analyze select n, id,
rhs from lhs left join (select id, date1, date2, source,payload from rhs) as
rhs using (id) where n > bigint E'3240754568';", params=params@entry=0x0,
queryEnv=queryEnv@entry=0x0, 
    planduration=planduration@entry=0x7ffc88d12820, bufusage=0x0) at
explain.c:571
#17 0x00000000005f2b8d in ExplainOneQuery (query=<optimized out>,
cursorOptions=<optimized out>, into=0x0, es=0x12f63b0, 
    queryString=0x122daa0 "explain analyze select n, id, rhs from lhs left
join (select id, date1, date2, source,payload from rhs) as rhs using (id)
where n > bigint E'3240754568';", params=0x0, queryEnv=0x0) at
explain.c:404
#18 0x00000000005f31f3 in ExplainQuery (pstate=pstate@entry=0x12517c0,
stmt=stmt@entry=0x122f848, params=params@entry=0x0,
dest=dest@entry=0x12f6318) at explain.c:275
#19 0x00000000007e2785 in standard_ProcessUtility (pstmt=0x12d6168,
queryString=0x122daa0 "explain analyze select n, id, rhs from lhs left join
(select id, date1, date2, source,payload from rhs) as rhs using (id) where n
> bigint E'3240754568';", 
    context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x12f6318, qc=0x7ffc88d12ac0) at utility.c:846
#20 0x00000000007e082e in PortalRunUtility (portal=portal@entry=0x128fb10,
pstmt=0x12d6168, isTopLevel=<optimized out>,
setHoldSnapshot=setHoldSnapshot@entry=true, dest=dest@entry=0x12f6318,
qc=qc@entry=0x7ffc88d12ac0) at pquery.c:1153
#21 0x00000000007e0ce5 in FillPortalStore (portal=portal@entry=0x128fb10,
isTopLevel=isTopLevel@entry=true) at pquery.c:1026
#22 0x00000000007e105c in PortalRun (portal=portal@entry=0x128fb10,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=true,
run_once=run_once@entry=true, dest=dest@entry=0x12ef350,
altdest=altdest@entry=0x12ef350, qc=qc@entry=0x7ffc88d12ca0)
    at pquery.c:760
#23 0x00000000007dc774 in exec_simple_query (query_string=0x122daa0 "explain
analyze select n, id, rhs from lhs left join (select id, date1, date2,
source,payload from rhs) as rhs using (id) where n > bigint E'3240754568';")
at postgres.c:1238
#24 0x00000000007dda1a in PostgresMain (argc=<optimized out>,
argv=argv@entry=0x1257bd8, dbname=0x1257af0 "dastapov", username=<optimized
out>) at postgres.c:4347
#25 0x000000000048409f in BackendRun (port=<optimized out>, port=<optimized
out>) at postmaster.c:4550
#26 BackendStartup (port=0x124fad0) at postmaster.c:4234
#27 ServerLoop () at postmaster.c:1739
#28 0x0000000000758e05 in PostmasterMain (argc=argc@entry=9,
argv=argv@entry=0x1228420) at postmaster.c:1412
#29 0x000000000048546e in main (argc=9, argv=0x1228420) at main.c:210

If I go to frame 3 (sts_read_tuple), it would be in this part of
sts_read_tuple:

    if (size > accessor->read_buffer_size)
    {
        size_t        new_read_buffer_size;

        if (accessor->read_buffer != NULL)
            pfree(accessor->read_buffer);
        new_read_buffer_size = Max(size, accessor->read_buffer_size * 2);
        accessor->read_buffer =
            MemoryContextAlloc(accessor->context, new_read_buffer_size);   <<<<-----
HERE is where the memory allocation is happening
        accessor->read_buffer_size = new_read_buffer_size;
    }

At this point:
(gdb) frame 3
#3  0x000000000092cd9d in sts_read_tuple (meta_data=0x7ffc88d1264c,
accessor=0x1360e00) at sharedtuplestore.c:450
450                MemoryContextAlloc(accessor->context, new_read_buffer_size);
(gdb) p accessor->read_buffer_size
$1 = 5641470
(gdb) p size
$2 = 1702125924
(gdb) p accessor->read_file->name
$5 = 0x1336af8 "i12of16.p1"
(gdb) p accessor->read_file->numFiles
$6 = 1


Looking in base/pgsql_tmp, i see there pgsql_tmp4511.0.sharedfileset which
is 5.1 GB large, and contains:
$ ls -ltr
-rw------- 1 user group 94M Sep 21 16:05 i7of8.p2.0
-rw------- 1 user group 90M Sep 21 16:05 i7of8.p1.0
-rw------- 1 user group 88M Sep 21 16:05 i7of8.p0.0
-rw------- 1 user group 96M Sep 21 16:05 i6of8.p2.0
-rw------- 1 user group 74M Sep 21 16:05 i6of8.p1.0
-rw------- 1 user group 75M Sep 21 16:05 i6of8.p0.0
-rw------- 1 user group 84M Sep 21 16:05 i5of8.p2.0
-rw------- 1 user group 84M Sep 21 16:05 i5of8.p1.0
-rw------- 1 user group 77M Sep 21 16:05 i5of8.p0.0
-rw------- 1 user group 66M Sep 21 16:05 i4of8.p2.0
-rw------- 1 user group 78M Sep 21 16:05 i4of8.p1.0
-rw------- 1 user group 78M Sep 21 16:05 i4of8.p0.0
-rw------- 1 user group 69M Sep 21 16:05 i3of8.p2.0
-rw------- 1 user group 86M Sep 21 16:05 i3of8.p1.0
-rw------- 1 user group 83M Sep 21 16:05 i3of8.p0.0
-rw------- 1 user group 80M Sep 21 16:05 i2of8.p2.0
-rw------- 1 user group 87M Sep 21 16:05 i2of8.p1.0
-rw------- 1 user group 88M Sep 21 16:05 i2of8.p0.0
-rw------- 1 user group 97M Sep 21 16:05 i1of8.p2.0
-rw------- 1 user group 77M Sep 21 16:05 i1of8.p1.0
-rw------- 1 user group 90M Sep 21 16:05 i1of8.p0.0
-rw------- 1 user group 86M Sep 21 16:05 i9of16.p2.0
-rw------- 1 user group 78M Sep 21 16:05 i9of16.p1.0
-rw------- 1 user group 82M Sep 21 16:05 i9of16.p0.0
-rw------- 1 user group 78M Sep 21 16:05 i8of16.p2.0
-rw------- 1 user group 70M Sep 21 16:05 i8of16.p1.0
-rw------- 1 user group 73M Sep 21 16:05 i8of16.p0.0
-rw------- 1 user group 78M Sep 21 16:05 i7of16.p2.0
-rw------- 1 user group 72M Sep 21 16:05 i7of16.p1.0
-rw------- 1 user group 73M Sep 21 16:05 i7of16.p0.0
-rw------- 1 user group 71M Sep 21 16:05 i6of16.p2.0
-rw------- 1 user group 75M Sep 21 16:05 i6of16.p1.0
-rw------- 1 user group 74M Sep 21 16:05 i6of16.p0.0
-rw------- 1 user group 77M Sep 21 16:05 i5of16.p2.0
-rw------- 1 user group 74M Sep 21 16:05 i5of16.p1.0
-rw------- 1 user group 74M Sep 21 16:05 i5of16.p0.0
-rw------- 1 user group 73M Sep 21 16:05 i4of16.p2.0
-rw------- 1 user group 74M Sep 21 16:05 i4of16.p1.0
-rw------- 1 user group 67M Sep 21 16:05 i4of16.p0.0
-rw------- 1 user group 74M Sep 21 16:05 i3of16.p2.0
-rw------- 1 user group 78M Sep 21 16:05 i3of16.p1.0
-rw------- 1 user group 72M Sep 21 16:05 i3of16.p0.0
-rw------- 1 user group 79M Sep 21 16:05 i2of16.p2.0
-rw------- 1 user group 73M Sep 21 16:05 i2of16.p1.0
-rw------- 1 user group 74M Sep 21 16:05 i2of16.p0.0
-rw------- 1 user group 70M Sep 21 16:05 i1of16.p2.0
-rw------- 1 user group 78M Sep 21 16:05 i1of16.p1.0
-rw------- 1 user group 80M Sep 21 16:05 i1of16.p0.0
-rw------- 1 user group 86M Sep 21 16:05 i15of16.p2.0
-rw------- 1 user group 85M Sep 21 16:05 i15of16.p1.0
-rw------- 1 user group 86M Sep 21 16:05 i15of16.p0.0
-rw------- 1 user group 78M Sep 21 16:05 i14of16.p2.0
-rw------- 1 user group 80M Sep 21 16:05 i14of16.p1.0
-rw------- 1 user group 71M Sep 21 16:05 i14of16.p0.0
-rw------- 1 user group 75M Sep 21 16:05 i13of16.p2.0
-rw------- 1 user group 71M Sep 21 16:05 i13of16.p1.0
-rw------- 1 user group 78M Sep 21 16:05 i13of16.p0.0
-rw------- 1 user group 69M Sep 21 16:05 i12of16.p2.0
-rw------- 1 user group 78M Sep 21 16:05 i12of16.p1.0
-rw------- 1 user group 80M Sep 21 16:05 i12of16.p0.0
-rw------- 1 user group 73M Sep 21 16:05 i11of16.p2.0
-rw------- 1 user group 72M Sep 21 16:05 i11of16.p1.0
-rw------- 1 user group 75M Sep 21 16:05 i11of16.p0.0
-rw------- 1 user group 79M Sep 21 16:05 i10of16.p2.0
-rw------- 1 user group 77M Sep 21 16:05 i10of16.p1.0
-rw------- 1 user group 80M Sep 21 16:05 i10of16.p0.0

So it looks like we are reading i12of16.p1.0, which is only 78M large. I
have retained a copy of the files in this location for further analysis, if
needed.

Unfortunately, I cannot share my dataset with you, and my attempts to come
up with a synthetic dataset that reproduces this issue have failed so far. I
am very confident, though, that issue is not caused by database corruption,
as I can take my dataset as two csv files, \copy them it into a freshly
initdb-ed database and immediately reproduce the issue.


Tables in question (lhs and rhs) have 953201 and 927968 rows accordingly and
are this large:
\dt+
                                   List of relations
 Schema │ Name │ Type  │  Owner   │ Persistence │ Access method │  Size   │
Description 
────────┼──────┼───────┼──────────┼─────────────┼───────────────┼─────────┼─────────────
 public │ lhs  │ table │ dastapov │ permanent   │ heap          │ 49 MB   │

 public │ rhs  │ table │ dastapov │ permanent   │ heap          │ 1747 MB │



\d+ lhs
                                    Table "public.lhs"
 Column │  Type  │ Collation │ Nullable │ Default │ Storage  │ Stats target
│ Description 
────────┼────────┼───────────┼──────────┼─────────┼──────────┼──────────────┼─────────────
 id     │ text   │           │          │         │ extended │
│ 
 n      │ bigint │           │          │         │ plain    │
│ 
Access method: heap


\d+ rhs
                                   Table "public.rhs"
 Column  │ Type │ Collation │ Nullable │ Default │ Storage  │ Stats target │
Description 
─────────┼──────┼───────────┼──────────┼─────────┼──────────┼──────────────┼─────────────
 id      │ text │           │          │         │ extended │              │

 date1   │ date │           │          │         │ plain    │              │

 date2   │ date │           │          │         │ plain    │              │

 source  │ text │           │          │         │ extended │              │

 payload │ text │           │          │         │ extended │              │

Access method: heap

# select tablename, attname, avg_width, n_distinct from pg_stats where
tablename in ('lhs', 'rhs') order by 1;
 tablename │ attname │ avg_width │ n_distinct  
───────────┼─────────┼───────────┼─────────────
 lhs       │ id      │        12 │ -0.95371073
 lhs       │ n       │         8 │          -1
 rhs       │ payload │      1383 │          -1
 rhs       │ id      │        12 │          -1
 rhs       │ date1   │         4 │         609
 rhs       │ date2   │         4 │          11
 rhs       │ source  │        17 │        4275


Potentially interesting piece of the puzzle is that there are some long
outliers in rhs.payload and rhs.source, but the rest of the columns have
values that are exactly of avg_width bytes:

# select log_len, count(*) from (select log(length(payload))::int as log_len
from rhs) foo group by 1 order by 2 desc;
 log_len │ count  
─────────┼────────
       3 │ 840852
       4 │  77776
       5 │   8003
       6 │   1317
       7 │     20
(5 rows)

# select log_len, count(*) from (select log(length(source))::int as log_len
from rhs) foo group by 1 order by 2 desc;
 log_len │ count  
─────────┼────────
       1 │ 907692
       3 │  11793
       2 │   4109
       4 │   2920
       5 │   1454
(5 rows)

As I said, I tried to create tables with similar data using generate_series,
but failed to reproduce the issue. However, I am happy to provide additional
information / do additional debugging, as necessary, in lieu of my inability
to share the offending dataset.

pg_settings:
# select name, setting from pg_settings where setting <> boot_val;
            name            │            setting            
────────────────────────────┼───────────────────────────────
 application_name           │ psql
 archive_command            │ (disabled)
 client_encoding            │ UTF8
 data_directory_mode        │ 0700
 default_text_search_config │ pg_catalog.english
 lc_collate                 │ en_US.utf8
 lc_ctype                   │ en_US.utf8
 lc_messages                │ en_US.utf8
 lc_monetary                │ en_US.utf8
 lc_numeric                 │ en_US.utf8
 lc_time                    │ en_US.utf8
 log_file_mode              │ 0600
 log_filename               │ postgresql-%a.log
 log_rotation_size          │ 0
 log_timezone               │ Europe/London
 log_truncate_on_rotation   │ on
 logging_collector          │ on
 max_locks_per_transaction  │ 128
 max_stack_depth            │ 2048
 server_encoding            │ UTF8
 shared_buffers             │ 4194304
 synchronous_commit         │ off
 temp_buffers               │ 32768
 TimeZone                   │ Europe/London
 unix_socket_directories    │ /home/dastapov/local/pg13-bug
 unix_socket_permissions    │ 0777
 wal_buffers                │ 2048
 work_mem                   │ 262144
(28 rows)

Let me know how I can assist further.


Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Thomas Munro
Date:
On Thu, Sep 22, 2022 at 7:46 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> (gdb) p size
> $2 = 1702125924

Thanks for the detailed report.  Hmm.  That size, on a little-endian
system, is equivalent to the byte sequence "date\0\0\0\0", which looks
pretty suspiciously like the inside of a tuple, and not its size.  We
must have got out of sync somehow.

> Potentially interesting piece of the puzzle is that there are some long
> outliers in rhs.payload and rhs.source, but the rest of the columns have
> values that are exactly of avg_width bytes:
>
> # select log_len, count(*) from (select log(length(payload))::int as log_len
> from rhs) foo group by 1 order by 2 desc;
>  log_len │ count
> ─────────┼────────
>        3 │ 840852
>        4 │  77776
>        5 │   8003
>        6 │   1317
>        7 │     20
> (5 rows)

So there are some strings up to order 10^7 in length in there.  The
file format consists of chunks, with a special case for tuples that
don't fit in one chunk.  Perhaps there is a bug in that logic.  It is
exercised in our regression tests, but perhaps not enough.  I'll try
to repro this from your clues.



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Dmitry Astapov
Date:


On Thu, Sep 22, 2022 at 9:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Sep 22, 2022 at 7:46 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> (gdb) p size
> $2 = 1702125924

Thanks for the detailed report.  Hmm.  That size, on a little-endian
system, is equivalent to the byte sequence "date\0\0\0\0", which looks
pretty suspiciously like the inside of a tuple, and not its size.  We
must have got out of sync somehow.

Great catch! I added some debugging output to the BufFileRead and sts_read_tuple and it looks to me that the sequence of events immediately before the problem happens is like this:

  1. sts_read_tuple is called multiple times to read from a particular file, and a bunch of tuples are read this way
  2. sts_read_tuple is called once more. In my experiments, I always had read_ntuples_available - 2 == read_ntuples at this point, but maybe this is just a coincidence.
    1.  inside this sts_read_tuple_call, BufFileRead is called to read metadata, and succeeds (and this chunk of the file looks -- to my naive eye -- exactly like any other tuple start)
    2. Second BufFileRead call happens to get the size, and reads the correct size (I see what's in the file, I could find corresponding values in the database tables, and it all lines up). 
    3. Third BufFileRead happens to read the actual tuple. So far so good, and sts_read_tuple finishes.
  3. At this point, the position in the file (buf_file -> currOffset + post) points right after the last tuple read, but what's there does not look like the start of the next tuple to me. It seems like there was another, longer tuple written there before (with quite long rhs.payload value) and somehow the tuple that was just read was written over it, so we have "<tuple read on step 2 above><current read position here><what looks like leftovers from another tuple><a bunch of zeroes><new tuple start>"
  4. So another sts_read_tuple call happens, and it reads 4 bytes of this garbage as metadata, and the next 4 bytes as length, and this is when everything derails...
I also double-checked at the logs of the (test) database where this issue was first spotted (and where a lot of queries similar to the one I am using to trigger this issue are happening) and found out that along with "invalid memory alloc request size" we also got a couple of occasions of "Unexpected chunk in shared tuplestore temporary file. Expected overflow chunk", which looks like another error that could be potentially explained by BufFileRead's inside sts_read_tuple reading "garbage".

I'm happy to build a patched version of 13.8 (or another revision if you want me to) to collect more debug information for you if this could help.

--
D. Astapov

Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Dmitry Astapov
Date:
Thomas, I think I know what is going on, and it is happening in sts_puttuple when it needs to write 32760 bytes exacly. All references below a taken in the tag REL_13_8.

Lets say that we enter sts_puttuple and on line 321 we compute a size of 32760 bytes. Lets say that buffer is currently partially filled so a write of this size would not fit.

We proceed to flush chunk on line 338, which resets the write pointer to 8 bytes past the start of SharedTuplestoreChunk, leaving exactly 32760 bytes free.

We take the if on line 342 (the one with the comment "It still might not be enough space", and then proceed to write whole 32760 bytes and reach the line 373 with size == 0. 

It looks like this codepath is written with assumption that size >0 here and sts_flush_chunk would be called immediately after inside the loop on line 375. But because size == 0, loop will not execute, sts_flush_chunk will not be called, and since we haven't updated write pointer so far, it will stay at the beginning of the chunk. At this point we exit sts_puttuple, and next call of sts_putttuple would overwrite the tuple we just memcopied into the chank. If this new tuple is shorter, then remaining part of the chunk would be garbage from the overwritten tuple.

I think >= on line 342 should be just > . I tried this change locally, and it fixed the issue for me.

Do you agree with my analysis?

On Thu, Sep 22, 2022 at 1:51 PM Dmitry Astapov <dastapov@gmail.com> wrote:


On Thu, Sep 22, 2022 at 9:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Sep 22, 2022 at 7:46 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> (gdb) p size
> $2 = 1702125924

Thanks for the detailed report.  Hmm.  That size, on a little-endian
system, is equivalent to the byte sequence "date\0\0\0\0", which looks
pretty suspiciously like the inside of a tuple, and not its size.  We
must have got out of sync somehow.

Great catch! I added some debugging output to the BufFileRead and sts_read_tuple and it looks to me that the sequence of events immediately before the problem happens is like this:

  1. sts_read_tuple is called multiple times to read from a particular file, and a bunch of tuples are read this way
  2. sts_read_tuple is called once more. In my experiments, I always had read_ntuples_available - 2 == read_ntuples at this point, but maybe this is just a coincidence.
    1.  inside this sts_read_tuple_call, BufFileRead is called to read metadata, and succeeds (and this chunk of the file looks -- to my naive eye -- exactly like any other tuple start)
    2. Second BufFileRead call happens to get the size, and reads the correct size (I see what's in the file, I could find corresponding values in the database tables, and it all lines up). 
    3. Third BufFileRead happens to read the actual tuple. So far so good, and sts_read_tuple finishes.
  3. At this point, the position in the file (buf_file -> currOffset + post) points right after the last tuple read, but what's there does not look like the start of the next tuple to me. It seems like there was another, longer tuple written there before (with quite long rhs.payload value) and somehow the tuple that was just read was written over it, so we have "<tuple read on step 2 above><current read position here><what looks like leftovers from another tuple><a bunch of zeroes><new tuple start>"
  4. So another sts_read_tuple call happens, and it reads 4 bytes of this garbage as metadata, and the next 4 bytes as length, and this is when everything derails...
I also double-checked at the logs of the (test) database where this issue was first spotted (and where a lot of queries similar to the one I am using to trigger this issue are happening) and found out that along with "invalid memory alloc request size" we also got a couple of occasions of "Unexpected chunk in shared tuplestore temporary file. Expected overflow chunk", which looks like another error that could be potentially explained by BufFileRead's inside sts_read_tuple reading "garbage".

I'm happy to build a patched version of 13.8 (or another revision if you want me to) to collect more debug information for you if this could help.

--
D. Astapov


--
D. Astapov

Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Dmitry Astapov
Date:
And I think that if on line 322 should be ">" and not ">=" as well. 

On Thu, 22 Sep 2022, 17:11 Dmitry Astapov, <dastapov@gmail.com> wrote:
Thomas, I think I know what is going on, and it is happening in sts_puttuple when it needs to write 32760 bytes exacly. All references below a taken in the tag REL_13_8.

Lets say that we enter sts_puttuple and on line 321 we compute a size of 32760 bytes. Lets say that buffer is currently partially filled so a write of this size would not fit.

We proceed to flush chunk on line 338, which resets the write pointer to 8 bytes past the start of SharedTuplestoreChunk, leaving exactly 32760 bytes free.

We take the if on line 342 (the one with the comment "It still might not be enough space", and then proceed to write whole 32760 bytes and reach the line 373 with size == 0. 

It looks like this codepath is written with assumption that size >0 here and sts_flush_chunk would be called immediately after inside the loop on line 375. But because size == 0, loop will not execute, sts_flush_chunk will not be called, and since we haven't updated write pointer so far, it will stay at the beginning of the chunk. At this point we exit sts_puttuple, and next call of sts_putttuple would overwrite the tuple we just memcopied into the chank. If this new tuple is shorter, then remaining part of the chunk would be garbage from the overwritten tuple.

I think >= on line 342 should be just > . I tried this change locally, and it fixed the issue for me.

Do you agree with my analysis?

On Thu, Sep 22, 2022 at 1:51 PM Dmitry Astapov <dastapov@gmail.com> wrote:


On Thu, Sep 22, 2022 at 9:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Thu, Sep 22, 2022 at 7:46 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
> (gdb) p size
> $2 = 1702125924

Thanks for the detailed report.  Hmm.  That size, on a little-endian
system, is equivalent to the byte sequence "date\0\0\0\0", which looks
pretty suspiciously like the inside of a tuple, and not its size.  We
must have got out of sync somehow.

Great catch! I added some debugging output to the BufFileRead and sts_read_tuple and it looks to me that the sequence of events immediately before the problem happens is like this:

  1. sts_read_tuple is called multiple times to read from a particular file, and a bunch of tuples are read this way
  2. sts_read_tuple is called once more. In my experiments, I always had read_ntuples_available - 2 == read_ntuples at this point, but maybe this is just a coincidence.
    1.  inside this sts_read_tuple_call, BufFileRead is called to read metadata, and succeeds (and this chunk of the file looks -- to my naive eye -- exactly like any other tuple start)
    2. Second BufFileRead call happens to get the size, and reads the correct size (I see what's in the file, I could find corresponding values in the database tables, and it all lines up). 
    3. Third BufFileRead happens to read the actual tuple. So far so good, and sts_read_tuple finishes.
  3. At this point, the position in the file (buf_file -> currOffset + post) points right after the last tuple read, but what's there does not look like the start of the next tuple to me. It seems like there was another, longer tuple written there before (with quite long rhs.payload value) and somehow the tuple that was just read was written over it, so we have "<tuple read on step 2 above><current read position here><what looks like leftovers from another tuple><a bunch of zeroes><new tuple start>"
  4. So another sts_read_tuple call happens, and it reads 4 bytes of this garbage as metadata, and the next 4 bytes as length, and this is when everything derails...
I also double-checked at the logs of the (test) database where this issue was first spotted (and where a lot of queries similar to the one I am using to trigger this issue are happening) and found out that along with "invalid memory alloc request size" we also got a couple of occasions of "Unexpected chunk in shared tuplestore temporary file. Expected overflow chunk", which looks like another error that could be potentially explained by BufFileRead's inside sts_read_tuple reading "garbage".

I'm happy to build a patched version of 13.8 (or another revision if you want me to) to collect more debug information for you if this could help.

--
D. Astapov


--
D. Astapov

Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Thomas Munro
Date:
On Fri, Sep 23, 2022 at 4:38 AM Dmitry Astapov <dastapov@gmail.com> wrote:
>> I think >= on line 342 should be just > . I tried this change locally, and it fixed the issue for me.
>>
>> Do you agree with my analysis?

Yeah.  It was hard to repro the problem from SQL, so I wrote a C
module to show the issue.  CREATE EXTENSION test_sts; SELECT
test_sts(10, 32756); shows it, and is fixed by your proposed change.
I will look at the analysis and fix some more next week, and see if
the test can be improved some more and is worth keeping.

While testing with that module I found another bug: the
per-participant npages counter was not explicitly initialised to zero
in sts_initialize().  That wasn't exactly a problem when the code was
written because new DSM memory is always zeroed and this always
happens in new DSM memory, but it shows up in this test module because
it uses palloc() memory instead.  It *is* a problem since v14, if you
use min_dynamic_shared_memory for a pool of recyclable shared memory,
because then it is not zeroed.

Attachment

Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> While testing with that module I found another bug: the
> per-participant npages counter was not explicitly initialised to zero
> in sts_initialize().  That wasn't exactly a problem when the code was
> written because new DSM memory is always zeroed and this always
> happens in new DSM memory, but it shows up in this test module because
> it uses palloc() memory instead.  It *is* a problem since v14, if you
> use min_dynamic_shared_memory for a pool of recyclable shared memory,
> because then it is not zeroed.

That's a fairly scary observation.  What other places are silently
expecting such memory to be zeroed?  Do we need to fix things so
that min_dynamic_shared_memory doesn't break this API?

            regards, tom lane



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Thomas Munro
Date:
On Sat, Sep 24, 2022 at 11:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> > While testing with that module I found another bug: the
> > per-participant npages counter was not explicitly initialised to zero
> > in sts_initialize().  That wasn't exactly a problem when the code was
> > written because new DSM memory is always zeroed and this always
> > happens in new DSM memory, but it shows up in this test module because
> > it uses palloc() memory instead.  It *is* a problem since v14, if you
> > use min_dynamic_shared_memory for a pool of recyclable shared memory,
> > because then it is not zeroed.
>
> That's a fairly scary observation.  What other places are silently
> expecting such memory to be zeroed?  Do we need to fix things so
> that min_dynamic_shared_memory doesn't break this API?

It wasn't assuming the memory was zeroed generally -- it initialises
various other members -- it just failed to do that for one member.
I'll survey other DSM users this week to see that they aren't relying
on zeroes like that.  For the parallel query DSM segment,
access/transam/parallel.c even has a case that replaces dsm_create()
with MemoryContextAlloc() sans MCXT_ALLOC_ZERO flag, and I am not
aware of any documented or undocumented expectation that it's
zero-initialised.  A quick check like this in dsm_create() doesn't
break any tests:

+#ifdef CLOBBER_FREED_MEMORY
+       memset(seg->mapped_address, 0xff, seg->mapped_size);
+#endif



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Peter Geoghegan
Date:
On Fri, Sep 23, 2022 at 4:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That's a fairly scary observation.  What other places are silently
> expecting such memory to be zeroed?  Do we need to fix things so
> that min_dynamic_shared_memory doesn't break this API?

Reminds me of the recent parallel VACUUM bug, fixed by commit
662ba729. That involved unintialized DSM, too.

As I said at the time: we should be using Valgrind for this.

-- 
Peter Geoghegan



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Peter Geoghegan
Date:
On Fri, Sep 23, 2022 at 5:12 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Reminds me of the recent parallel VACUUM bug, fixed by commit
> 662ba729. That involved unintialized DSM, too.
>
> As I said at the time: we should be using Valgrind for this.

Uh, why is it okay that we don't zero-initialize memory used for
things like PARALLEL_KEY_BUFFER_USAGE and PARALLEL_KEY_WAL_USAGE?

(No prizes for guessing why I ask.)
-- 
Peter Geoghegan



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Peter Geoghegan
Date:
On Fri, Sep 23, 2022 at 6:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Uh, why is it okay that we don't zero-initialize memory used for
> things like PARALLEL_KEY_BUFFER_USAGE and PARALLEL_KEY_WAL_USAGE?

Ping? I'm pretty sure that this needs to be fixed.

-- 
Peter Geoghegan



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Fri, Sep 23, 2022 at 6:31 PM Peter Geoghegan <pg@bowt.ie> wrote:
>> Uh, why is it okay that we don't zero-initialize memory used for
>> things like PARALLEL_KEY_BUFFER_USAGE and PARALLEL_KEY_WAL_USAGE?

> Ping? I'm pretty sure that this needs to be fixed.

That scares me too, but how come things aren't falling over
routinely?  Can we even make a test case where it breaks?

I think I'd personally prefer to treat such memory more like we
treat palloc'd memory, ie there's *not* a guarantee of zero
initialization and indeed testing builds intentionally clobber it.

            regards, tom lane



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Peter Geoghegan
Date:
On Tue, Sep 27, 2022 at 9:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> That scares me too, but how come things aren't falling over
> routinely?  Can we even make a test case where it breaks?

I'm not completely sure, but I think that the explanation might just
be that the memory is likely to be "zero initialized" in practice.
Even when it isn't we're still only talking about instrumentation
counters that start out with garbage values -- so nothing truly
critical.

My main concern is the big picture, and not so much these specific
lapses. Less benign crash bugs like the one fixed by commit 662ba729
are much less likely with proper testing.

> I think I'd personally prefer to treat such memory more like we
> treat palloc'd memory, ie there's *not* a guarantee of zero
> initialization and indeed testing builds intentionally clobber it.

Isn't that already how it works? The problem is that it's not
particularly clear that that's how it works right now. And that the
dynamic shared memory stuff isn't tested via the same techniques that
we use for palloc.

We could more or less apply the same techniques and expect the same
good results, but we don't do that right now.

-- 
Peter Geoghegan



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Tue, Sep 27, 2022 at 9:24 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I think I'd personally prefer to treat such memory more like we
>> treat palloc'd memory, ie there's *not* a guarantee of zero
>> initialization and indeed testing builds intentionally clobber it.

> Isn't that already how it works? The problem is that it's not
> particularly clear that that's how it works right now. And that the
> dynamic shared memory stuff isn't tested via the same techniques that
> we use for palloc.

Right, the missing piece is the intentional clobber.  Thomas indicated
he'd made such a test locally, but I think it needs full support with
the same options that mcxt.c has (CLOBBER_FREED_MEMORY and so on
--- although unmapping the memory is equally good for that, if we
always do it).

            regards, tom lane



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Peter Geoghegan
Date:
On Tue, Sep 27, 2022 at 9:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Right, the missing piece is the intentional clobber.  Thomas indicated
> he'd made such a test locally, but I think it needs full support with
> the same options that mcxt.c has (CLOBBER_FREED_MEMORY and so on
> --- although unmapping the memory is equally good for that, if we
> always do it).

I'd also like to use Valgrind here. That's how I noticed this issue,
in fact. I wrote a very rough patch (too rough to even post as an
FYI).

I don't think that CLOBBER_FREED_MEMORY would necessarily have
detected the problems with
PARALLEL_KEY_BUFFER_USAGE/PARALLEL_KEY_WAL_USAGE (or the analogous
problems in nbtsort.c and vacuumparallel.c). The big advantage of
custom Valgrind instrumentation that marks newly allocated memory
undefined (not inaccessible) is the provenance stuff. Valgrind will
track the provenance of the uninitialized memory reliably, even as the
values are copied around. Any control flow that relies on the value
will make Valgrind complain.

-- 
Peter Geoghegan



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Peter Geoghegan
Date:
On Tue, Sep 27, 2022 at 9:40 AM Peter Geoghegan <pg@bowt.ie> wrote:
> I'm not completely sure, but I think that the explanation might just
> be that the memory is likely to be "zero initialized" in practice.
> Even when it isn't we're still only talking about instrumentation
> counters that start out with garbage values -- so nothing truly
> critical.

Actually, it looks like we'll always have each worker zero-initialize
their space for buffer stats and WAL stats in InstrEndParallelQuery(),
which is also called by nbtsort.c and by parallel VACUUM. So as it
turns out there is no issue with uninitialized memory here.

I don't think that that changes all that much. It still seems like a
good idea to do all required zero initialization in the leader, up
front. We're talking about very small structures here, so there is no
reason to not have a uniform, centralized approach in
ParallelQueryMain(). We actually zero initialized the dead_items array
for PARALLEL_VACUUM_KEY_DEAD_ITEMS, which is probably unnecessary and
has non-trivial overhead, but we don't zero initialize relatively tiny
WAL and buffer instrumentation counters nearby.

-- 
Peter Geoghegan



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Peter Geoghegan
Date:
On Tue, Sep 27, 2022 at 9:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Right, the missing piece is the intentional clobber.

That does seem like the best place to start. The attached patch adds
clobbering that works exactly as you'd expect. This approach is
obviously correct. It also doesn't require any reasoning about
Valgrind's treatment of memory mappings for shared memory, which is
quite complicated given the inconsistent rules about who initializes
what memory (if it's leader or workers).

I find that the tests pass with this patch -- so it probably won't
catch the bug that Thomas mentioned via running the tests (at least
not reliably). However, if I revert parallel VACUUM bugfix commit
662ba729 and then run the tests, they fail very reliably, in several
places. That seems like a big improvement.

I believe that Thomas was going to do something like this anyway. I'm
happy to leave it up to him, but I can pursue this separately if that
makes sense.
-- 
Peter Geoghegan

Attachment

Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Thomas Munro
Date:
On Wed, Sep 28, 2022 at 7:33 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, Sep 27, 2022 at 9:44 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Right, the missing piece is the intentional clobber.
>
> That does seem like the best place to start. The attached patch adds
> clobbering that works exactly as you'd expect. This approach is
> obviously correct. It also doesn't require any reasoning about
> Valgrind's treatment of memory mappings for shared memory, which is
> quite complicated given the inconsistent rules about who initializes
> what memory (if it's leader or workers).
>
> I find that the tests pass with this patch -- so it probably won't
> catch the bug that Thomas mentioned via running the tests (at least
> not reliably). However, if I revert parallel VACUUM bugfix commit
> 662ba729 and then run the tests, they fail very reliably, in several
> places. That seems like a big improvement.

The reason it doesn't catch that bug on master is because that npages
shmem variable is only used to prevent further reading once a scan
hits the end of a shared tuplestore chunk and needs to decide whether
to read a new one, but if a chunk is partially filled then we end the
scan sooner because there's a number-of-items counter in the chunk
header.  I noticed because the test module I wrote to study Dmitry's
report fills chunks exactly to the end, so I assume the clobber patch
+ that test module patch would reveal the problem.

I was assuming it didn't break the case you mentioned because that's
just stats counters (maybe those finish up wrong but that's probably
not a failure), but now it sounds like you've seen another reason.

> I believe that Thomas was going to do something like this anyway. I'm
> happy to leave it up to him, but I can pursue this separately if that
> makes sense.

Why not clobber "lower down" in dsm_create(), as I showed?  You don't
have to use the table-of-contents mechanism to use DSM memory.



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Peter Geoghegan
Date:
On Tue, Sep 27, 2022 at 12:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > I believe that Thomas was going to do something like this anyway. I'm
> > happy to leave it up to him, but I can pursue this separately if that
> > makes sense.
>
> Why not clobber "lower down" in dsm_create(), as I showed?  You don't
> have to use the table-of-contents mechanism to use DSM memory.

I have no strong feelings either way. That approach might well be better.

It might even be useful to do both together. The redundancy probably
wouldn't hurt, and might even help in the future (it might not stay
redundant forever). We don't necessarily need to worry too much about
added cycles for something like this. Just as long as it's not
*completely* gratuitous.

-- 
Peter Geoghegan



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Masahiko Sawada
Date:
On Wed, Sep 28, 2022 at 6:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Tue, Sep 27, 2022 at 12:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > I believe that Thomas was going to do something like this anyway. I'm
> > > happy to leave it up to him, but I can pursue this separately if that
> > > makes sense.
> >
> > Why not clobber "lower down" in dsm_create(), as I showed?  You don't
> > have to use the table-of-contents mechanism to use DSM memory.
>
> I have no strong feelings either way. That approach might well be better.
>
> It might even be useful to do both together. The redundancy probably
> wouldn't hurt, and might even help in the future (it might not stay
> redundant forever). We don't necessarily need to worry too much about
> added cycles for something like this. Just as long as it's not
> *completely* gratuitous.

+1

I think we can clobber the memory also in dsm_deatch() if the memory
comes from the pool configured by min_dynamic_shared_memory.

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: BUG #17619: AllocSizeIsValid violation in parallel hash join

From
Masahiko Sawada
Date:
On Fri, Sep 30, 2022 at 3:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Wed, Sep 28, 2022 at 6:17 AM Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Tue, Sep 27, 2022 at 12:15 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > > I believe that Thomas was going to do something like this anyway. I'm
> > > > happy to leave it up to him, but I can pursue this separately if that
> > > > makes sense.
> > >
> > > Why not clobber "lower down" in dsm_create(), as I showed?  You don't
> > > have to use the table-of-contents mechanism to use DSM memory.
> >
> > I have no strong feelings either way. That approach might well be better.
> >
> > It might even be useful to do both together. The redundancy probably
> > wouldn't hurt, and might even help in the future (it might not stay
> > redundant forever). We don't necessarily need to worry too much about
> > added cycles for something like this. Just as long as it's not
> > *completely* gratuitous.
>
> +1
>
> I think we can clobber the memory also in dsm_deatch() if the memory
> comes from the pool configured by min_dynamic_shared_memory.

I updated the patch so that it clobbers memory also when dsm_create()
and dsm_detach().

Regards,

-- 
Masahiko Sawada
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment