Re: parallel data loading for pgbench -i - Mailing list pgsql-hackers

From lakshmi
Subject Re: parallel data loading for pgbench -i
Date
Msg-id CAEvyyTircZ-tHgap=J6Aog0CBgXp4Dqx6dHYyK1iqgfoT+8D_A@mail.gmail.com
Whole thread
In response to Re: parallel data loading for pgbench -i  (Mircea Cadariu <cadariu.mircea@gmail.com>)
Responses Re: parallel data loading for pgbench -i
List pgsql-hackers

Hi Mircea, Hayato,

Thanks for the updated v2 patches.

I applied 0001 and 0002 on 19devel and ran some tests. The results look consistent.

For scale 100, parallel loading speeds up data generation, but in the non-partitioned case, the VACUUM phase becomes noticeably slower. In contrast, the partitioned + parallel case performs best overall with much lower vacuum cost.

For scale 500, I see the same pattern: non-partitioned parallel runs are dominated by VACUUM time, while the partitioned setup shows a clear overall speedup.

I also verified correctness, and row counts match expected values.

So overall, the benefit of parallel loading is much clearer in the partitioned case.

I’ll try to look further into the VACUUM behavior.

Thanks again for the work on this.

Best regards,
Lakshmi


On Fri, Mar 13, 2026 at 11:59 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:
Hi Lakshmi, Hayato,


Thanks a lot for your input!

I'm not sure why the VACUUM phase takes longer compared to the serial
run. We can potentially get a clue with a profiler. I know there is an
ongoing effort to introduce parallel heap vacuum [1] which I expect will
help with this.

The code comments you have provided me have been applied to the v2 patch
attached. Below I provide answers to the questions.

> Also, why is -j accepted in case of non-partitions?
For non-partitioned tables, each worker loads a separate range of rows
via its own connection in parallel.

> Copying seems to be divided into chunks per COPY_BATCH_SIZE. Is it really
> essential to parallelize the initialization? I feel it may optimize even
> serialized case thus can be discussed independently.
You're right that the COPY batching is an optimization that's
independent. I wanted to see how fast I can get this patch, so I looked
for bottlenecks in the new code with a profiler and this was one of
them. I agree it makes sense to apply this for the serialised case
separately.

> Per my understanding, each thread creates its tables, and all of them are
> attached to the parent table. Is it right? I think it needs more code
> changes, and I am not sure it is critical to make initialization faster.
Yes, that's correct. Each worker creates its assigned partitions as
standalone tables, loads data into them, and then the main thread
attaches them all to the parent after loading completes. It's to avoid
AccessExclusiveLock contention on the parent table during parallel
loading and allow each worker to use COPY FREEZE on its standalone table.

> So I suggest using the incremental approach. The first patch only
> parallelizes
> the data load, and the second patch implements the CREATE TABLE and
> ALTER TABLE
> ATTACH PARTITION. You can benchmark three patterns, master, 0001, and
> 0001 + 0002, then compare the results. IIUC, this is the common
> approach to
> reduce the patch size and make them more reviewable.

Thanks for the recommendation, I extracted 0001 and 0002 as per your
suggestion. I will see if I can split it more, as indeed it helps with
the review.

Results are similar with the previous runs.

master

pgbench -i -s 100 -j 10
done in 20.95 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 14.51 s, vacuum 0.27 s, primary keys 6.16 s).

pgbench -i -s 100 -j 10 --partitions=10
done in 29.73 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.33 s, vacuum 8.72 s, primary keys 4.67 s).


0001
pgbench -i -s 100 -j 10
done in 18.75 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 6.51 s, vacuum 5.73 s, primary keys 6.50 s).

pgbench -i -s 100 -j 10 --partitions=10
done in 29.33 s (drop tables 0.00 s, create tables 0.02 s, client-side
generate 16.48 s, vacuum 7.59 s, primary keys 5.24 s).

0002
pgbench -i -s 100 -j 10
done in 18.12 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 6.64 s, vacuum 5.81 s, primary keys 5.65 s).

pgbench -i -s 100 -j 10 --partitions=10
done in 14.38 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 7.97 s, vacuum 1.55 s, primary keys 4.85 s).


Looking forward to your feedback.

[1]:
https://www.postgresql.org/message-id/CAD21AoAEfCNv-GgaDheDJ%2Bs-p_Lv1H24AiJeNoPGCmZNSwL1YA%40mail.gmail.com

--
Thanks,
Mircea Cadariu

pgsql-hackers by date:

Previous
From: Henson Choi
Date:
Subject: Re: SQL Property Graph Queries (SQL/PGQ)
Next
From: Ashutosh Sharma
Date:
Subject: Re: synchronized_standby_slots behavior inconsistent with quorum-based synchronous replication