Re: Parallel copy - Mailing list pgsql-hackers

From Greg Nancarrow
Subject Re: Parallel copy
Date
Msg-id CAJcOf-fea8MrCV+2M=w8jMjp3L7UwJWtEyFsRXV5-gMHxr89CQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (vignesh C <vignesh21@gmail.com>)
List pgsql-hackers
>On Wed, Sep 2, 2020 at 3:40 PM vignesh C <vignesh21@gmail.com> wrote:
> I have attached the scripts that I used for the test results I
> mentioned in my previous mail. create.sql file has the table that I
> used, insert_data_gen.txt has the insert data generation scripts. I
> varied the count in insert_data_gen to generate csv files of 1GB, 2GB
> & 5GB & varied the data to generate 1 char, 10 char & 100 char for
> each column for various testing. You can rename insert_data_gen.txt to
> insert_data_gen.sh & generate the csv file.


Hi Vignesh,

I used your script and table definition, multiplying the number of
records to produce a 5GB and 9.5GB CSV file.
I got the following results:


(1) Postgres default settings, 5GB CSV (530000 rows):

Copy Type            Duration (s)   Load factor
===============================================
Normal Copy          132.197         -

Parallel Copy
(#workers)
1                    98.428          1.34
2                    52.753          2.51
3                    37.630          3.51
4                    33.554          3.94
5                    33.636          3.93
6                    33.821          3.91
7                    34.270          3.86
8                    34.465          3.84
9                    34.315          3.85
10                   33.543          3.94


(2) Postgres increased resources, 5GB CSV (530000 rows):

shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB
work_mem = 10% of RAM = 38GB
maintenance_work_mem = 10% of RAM = 38GB
max_worker_processes = 16
max_parallel_workers = 16
checkpoint_timeout = 30min
max_wal_size=2GB


Copy Type            Duration (s)   Load factor
===============================================
Normal Copy          131.835         -

Parallel Copy
(#workers)
1                    98.301          1.34
2                    53.261          2.48
3                    37.868          3.48
4                    34.224          3.85
5                    33.831          3.90
6                    34.229          3.85
7                    34.512          3.82
8                    34.303          3.84
9                    34.690          3.80
10                   34.479          3.82



(3) Postgres default settings, 9.5GB CSV (1000000 rows):

Copy Type            Duration (s)   Load factor
===============================================
Normal Copy          248.503         -

Parallel Copy
(#workers)
1                    185.724         1.34
2                    99.832          2.49
3                    70.560          3.52
4                    63.328          3.92
5                    63.182          3.93
6                    64.108          3.88
7                    64.131          3.87
8                    64.350          3.86
9                    64.293          3.87
10                   63.818          3.89


(4) Postgres increased resources, 9.5GB CSV (1000000 rows):

shared_buffers = 20% of RAM (total RAM = 376GB) = 76GB
work_mem = 10% of RAM = 38GB
maintenance_work_mem = 10% of RAM = 38GB
max_worker_processes = 16
max_parallel_workers = 16
checkpoint_timeout = 30min
max_wal_size=2GB


Copy Type            Duration (s)   Load factor
===============================================
Normal Copy          248.647        -

Parallel Copy
(#workers)
1                    182.236        1.36
2                    92.814         2.68
3                    67.347         3.69
4                    63.839         3.89
5                    62.672         3.97
6                    63.873         3.89
7                    64.930         3.83
8                    63.885         3.89
9                    62.397         3.98
10                   64.477         3.86



So as you found, with this particular table definition and data, 1
parallel worker always performs better than normal copy.
The different result obtained for this particular case seems to be
caused by the following factors:
- different table definition (I used a variety of column types)
- amount of data per row (I used less data per row, so more rows per
same size data file)

As I previously observed, if the target table has no indexes,
increasing resources beyond the default settings makes little
difference to the performance.

Regards,
Greg Nancarrow
Fujitsu Australia



pgsql-hackers by date:

Previous
From: torikoshia
Date:
Subject: Re: Get memory contexts of an arbitrary backend process
Next
From: Fujii Masao
Date:
Subject: Re: New statistics for tuning WAL buffer size