Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker - Mailing list pgsql-bugs

From Craig Milhiser
Subject Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Date
Msg-id CA+wnhO3t7xqCiDmGduxgg1Mv5M5R-GeQd0kzVcJKkUFkj91Wbw@mail.gmail.com
Whole thread Raw
In response to Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker  (Andrei Lepikhov <lepihov@gmail.com>)
Responses Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
List pgsql-bugs
Thomas Munro wrote
> I wonder if there something could be wrong with Parallel Hash Right Join...Definitely keen to see a reproducer
> with synthetic data if you can come up with one

Andrei Lepikhov wrote
> I can imagine only a data skew or a logical bug in this part of the code.
> But without direct perf and gdb touch, it is hard to resolve the issue
> by just gazing into the code.

Both of you are correct. 

I have reproduced the problem with synthetic data. The script is below. Thank you for your patience with me.

There are comments in the script. Please let me know of any questions or if you cannot reproduce it. If you want me to file a report via the form, let me know.

Using Postgres v17 with out of the box configuration. 

drop table test_users;
create table test_users (account_id bigint not null, users_id bigint not null constraint test_users_pkey primary key, first_name varchar(105), last_name varchar(105), user_image_id bigint);

-- The account we are interested, data numbers are negative to eliminate duplicates and help with debugging
insert into test_users (account_id, users_id, first_name, last_name, user_image_id)
    SELECT -1, -1 * i, md5(random()::text), md5(random()::text), case when random() < 0.95 then null else -1 * i end
    FROM generate_series(1, 925_000) AS t(i)
;

-- Make enough other records to get the skew to force a Parallel Hash Right Join and the query breaks
-- Change the "< 0.50" to "< 0.95" to get a skew for a Parallel Hash Left Join and the query works
-- 0.50 makes a right join and breaks; 0.95 makes a left join and works
-- Changes how many users are in user_image which, relative to the number of users and accounts, is the key skew that I found
-- Data numbers are positive
insert into test_users(account_id, users_id, first_name, last_name, user_image_id)
    SELECT random(10, 50_000)::bigint, i, md5(random()::text), md5(random()::text), case when random() < 0.50 then null else i end
    FROM generate_series(1, 50_000_000) AS t(i)
;

create index user_img_fk_idx on test_users using btree (user_image_id);

drop table test_user_image;
create table test_user_image(user_image_id bigint not null constraint test_user_image_pkey primary key);
insert into test_user_image(user_image_id) select user_image_id from test_users where user_image_id is not null;

ALTER TABLE test_users ADD CONSTRAINT users_user_image_fk FOREIGN KEY (user_image_id) REFERENCES test_user_image(user_image_id);

commit;
analyze test_users;
analyze test_user_image;

-- at 0 workers the query will work
set max_parallel_workers_per_gather = 0;

SELECT U.USERS_ID
    ,  U.FIRST_NAME
    ,  U.LAST_NAME
FROM test_USERS U
    LEFT JOIN test_USER_IMAGE UI
        ON U.USER_IMAGE_ID = UI.USER_IMAGE_ID
WHERE U.ACCOUNT_ID = -1
GROUP BY U.USERS_ID
       , UI.USER_IMAGE_ID
ORDER BY U.LAST_NAME ASC
       , U.FIRST_NAME ASC
       , U.USERS_ID ASC
LIMIT 20
OFFSET 0
;

set max_parallel_workers_per_gather = 2;
-- Explain the above query. For it to break, a Parallel Hash Right Join is executed. 
-- When a Parallel Hash Left Join is executed, the query works. Switch between left and right by changing the skew as noted above. 
-- when run with stock Postgres 17, the Invalid DSA memory allocation occurs, which started this thread
-- when run with the patch to fix the Invalid DSA memory allocation, the OOM occurs.
-- I reproduced the Invalid DSA memory allocation with AWS Aurora v16.2. Naturally I cannot try patches there. The above was recreated with stock Postgres v17 on a plain ec2 instance.

Thank you for your time

Craig



pgsql-bugs by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: Question of Parallel Hash Join on TPC-H Benchmark
Next
From: Thomas Munro
Date:
Subject: Re: Reference to - BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker