Thread: Re: BUG #16925: ERROR: invalid DSA memory alloc request size 1073741824 CONTEXT: parallel worker

On Thu, Mar 18, 2021 at 2:06 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Mar 13, 2021 at 11:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > > > 2021-03-12 19:45:37.352 CET [316243] ERROR:  XX000: invalid DSA memory alloc
> > > > request size 1073741824
>
> After an off-list exchange with Karen and colleague who ran this with
> the ERROR changed to a PANIC and examined the smoldering core, the
> problem turns out to be a failure to keep the hashtable bucket array
> <= MaxAllocSize in one code path.  Although commit 86a2218e fixed
> another version of that problem a while ago, it can still be
> exceeded... by one byte... when we expand from one batch to many.
> Will propose a fix.

Here's a standalone reproducer with the right parameters to reach this
error, and a simple fix.  (Definitely room for more improvements in
this area of code... but that'll have to be a project for later.)

===8<===
shared_buffers=2GB
fsync=off
max_wal_size=10GB
min_dynamic_shared_memory=2GB
===8<===
create table bigger_than_it_looks as
  select generate_series(1, 256000000) as id;
alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
alter table bigger_than_it_looks set (parallel_workers = 1);
analyze bigger_than_it_looks;
update pg_class set reltuples = 5000000 where relname = 'bigger_than_it_looks';
===8<===
postgres=# set work_mem = '4.5GB';
SET
postgres=# explain analyze select count(*) from bigger_than_it_looks
t1 join bigger_than_it_looks t2 using (id);
ERROR:  invalid DSA memory alloc request size 1073741824
CONTEXT:  parallel worker
===8<===

Attachment
On 18/3/2021 16:21, Thomas Munro wrote:
> ===8<===
> shared_buffers=2GB
> fsync=off
> max_wal_size=10GB
> min_dynamic_shared_memory=2GB
> ===8<===
> create table bigger_than_it_looks as
>    select generate_series(1, 256000000) as id;
> alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
> alter table bigger_than_it_looks set (parallel_workers = 1);
> analyze bigger_than_it_looks;
> update pg_class set reltuples = 5000000 where relname = 'bigger_than_it_looks';
> ===8<===
> postgres=# set work_mem = '4.5GB';
> SET
> postgres=# explain analyze select count(*) from bigger_than_it_looks
> t1 join bigger_than_it_looks t2 using (id);
> ERROR:  invalid DSA memory alloc request size 1073741824
> CONTEXT:  parallel worker
> ===8<===

This bug still annoyingly interrupts the queries of some clients. Maybe 
complete this work?
It is stable and reproduces on all PG versions. The case:
work_mem = '2GB'

test table:
-----------
CREATE TABLE bigger_than_it_looks AS
   SELECT generate_series(1, 512E6) AS id;
ALTER TABLE bigger_than_it_looks SET (autovacuum_enabled = 'false');
ALTER TABLE bigger_than_it_looks SET (parallel_workers = 1);
ANALYZE bigger_than_it_looks;
UPDATE pg_class SET reltuples = 5000000
WHERE relname = 'bigger_than_it_looks';

The parallel workers number impacts size of the allowed memory under the 
hash table and in that sense correlates with the work_mem value, needed 
for the bug reproduction (keep in mind also that hash_mem_multiplier has 
been changed recently).

Query:
SELECT sum(a.id)
FROM bigger_than_it_looks a
JOIN bigger_than_it_looks b ON a.id =b.id
LEFT JOIN bigger_than_it_looks c ON b.id = c.id;

Any query that needs Parallel Hash Join can be found here. The case here 
is as follows.
The first batch contains a lot of tuples (on increment, it has about 
67mln tuples.). We calculate the number of buckets needed, approximately 
134 mln (134217728). Remember, the size of dsa_pointer_atomic is 8 in my 
case, and it ends up with an overflow of the max number of DSA, which 
can be allocated (1073741823 bytes).
See the new patch in the attachment.

-- 
regards,
Andrei Lepikhov
Postgres Professional

Attachment
On Wed, Dec 6, 2023 at 5:46 PM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 18/3/2021 16:21, Thomas Munro wrote:
> > ===8<===
> > shared_buffers=2GB
> > fsync=off
> > max_wal_size=10GB
> > min_dynamic_shared_memory=2GB
> > ===8<===
> > create table bigger_than_it_looks as
> >    select generate_series(1, 256000000) as id;
> > alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
> > alter table bigger_than_it_looks set (parallel_workers = 1);
> > analyze bigger_than_it_looks;
> > update pg_class set reltuples = 5000000 where relname = 'bigger_than_it_looks';
> > ===8<===
> > postgres=# set work_mem = '4.5GB';
> > SET
> > postgres=# explain analyze select count(*) from bigger_than_it_looks
> > t1 join bigger_than_it_looks t2 using (id);
> > ERROR:  invalid DSA memory alloc request size 1073741824
> > CONTEXT:  parallel worker
> > ===8<===
>
> This bug still annoyingly interrupts the queries of some clients. Maybe
> complete this work?

Ugh, sorry.  We had a report, a repro and a candidate patch a couple
of years ago, but I somehow completely forgot about it.  I have now
added a CF entry (#4689).



Hi! Thank you for your work on the subject.

On 11.12.2023 02:30, Alena Rybakina wrote:
>
> On 06.12.2023 07:46, Andrei Lepikhov wrote:
>> On 18/3/2021 16:21, Thomas Munro wrote:
>>> ===8<===
>>> shared_buffers=2GB
>>> fsync=off
>>> max_wal_size=10GB
>>> min_dynamic_shared_memory=2GB
>>> ===8<===
>>> create table bigger_than_it_looks as
>>>    select generate_series(1, 256000000) as id;
>>> alter table bigger_than_it_looks set (autovacuum_enabled = 'false');
>>> alter table bigger_than_it_looks set (parallel_workers = 1);
>>> analyze bigger_than_it_looks;
>>> update pg_class set reltuples = 5000000 where relname = 
>>> 'bigger_than_it_looks';
>>> ===8<===
>>> postgres=# set work_mem = '4.5GB';
>>> SET
>>> postgres=# explain analyze select count(*) from bigger_than_it_looks
>>> t1 join bigger_than_it_looks t2 using (id);
>>> ERROR:  invalid DSA memory alloc request size 1073741824
>>> CONTEXT:  parallel worker
>>> ===8<===
>>
>> This bug still annoyingly interrupts the queries of some clients. 
>> Maybe complete this work?
>> It is stable and reproduces on all PG versions. The case:
>> work_mem = '2GB'
>>
>> test table:
>> -----------
>> CREATE TABLE bigger_than_it_looks AS
>>   SELECT generate_series(1, 512E6) AS id;
>> ALTER TABLE bigger_than_it_looks SET (autovacuum_enabled = 'false');
>> ALTER TABLE bigger_than_it_looks SET (parallel_workers = 1);
>> ANALYZE bigger_than_it_looks;
>> UPDATE pg_class SET reltuples = 5000000
>> WHERE relname = 'bigger_than_it_looks';
>>
>> The parallel workers number impacts size of the allowed memory under 
>> the hash table and in that sense correlates with the work_mem value, 
>> needed for the bug reproduction (keep in mind also that 
>> hash_mem_multiplier has been changed recently).
>>
>> Query:
>> SELECT sum(a.id)
>> FROM bigger_than_it_looks a
>> JOIN bigger_than_it_looks b ON a.id =b.id
>> LEFT JOIN bigger_than_it_looks c ON b.id = c.id;
>>
>> Any query that needs Parallel Hash Join can be found here. The case 
>> here is as follows.
>> The first batch contains a lot of tuples (on increment, it has about 
>> 67mln tuples.). We calculate the number of buckets needed, 
>> approximately 134 mln (134217728). Remember, the size of 
>> dsa_pointer_atomic is 8 in my case, and it ends up with an overflow 
>> of the max number of DSA, which can be allocated (1073741823 bytes).
>> See the new patch in the attachment.
I've looked through your code and haven't seen any errors yet, but I 
think we could rewrite these lines of code as follows:

-                    dbuckets = ceil(dtuples / NTUP_PER_BUCKET);
-                    dbuckets = Min(dbuckets, max_buckets);
-                    new_nbuckets = (int) dbuckets;
-                    new_nbuckets = Max(new_nbuckets, 1024);
+                    dbuckets = Min(ceil(dtuples / NTUP_PER_BUCKET), 
max_buckets);
+                    new_nbuckets = Max((int) dbuckets, 1024);

I have attached a diff file with the proposed changes to this email.

-- 
Regards,
Alena Rybakina
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment
On 10/12/2023 09:41, Thomas Munro wrote:
> On Wed, Dec 6, 2023 at 5:46 PM Andrei Lepikhov
> <a.lepikhov@postgrespro.ru> wrote:
>> This bug still annoyingly interrupts the queries of some clients. Maybe
>> complete this work?
> 
> Ugh, sorry.  We had a report, a repro and a candidate patch a couple
> of years ago, but I somehow completely forgot about it.  I have now
> added a CF entry (#4689).

Thanks. I think the Parallel Hash Join code is worth discovering extreme 
cases anyway, but in that case, we have quite a clear bug and must fix it.

-- 
regards,
Andrei Lepikhov
Postgres Professional




On Mon, Dec 11, 2023 at 4:26 AM Andrei Lepikhov
<a.lepikhov@postgrespro.ru> wrote:
> On 10/12/2023 09:41, Thomas Munro wrote:
> > On Wed, Dec 6, 2023 at 5:46 PM Andrei Lepikhov
> > <a.lepikhov@postgrespro.ru> wrote:
> >> This bug still annoyingly interrupts the queries of some clients. Maybe
> >> complete this work?
> >
> > Ugh, sorry.  We had a report, a repro and a candidate patch a couple
> > of years ago, but I somehow completely forgot about it.  I have now
> > added a CF entry (#4689).
>
> Thanks. I think the Parallel Hash Join code is worth discovering extreme
> cases anyway, but in that case, we have quite a clear bug and must fix it.

Thank you.  I've pushed and backpatched this to PG 12 with some editing from me.

------
Regards,
Alexander Korotkov