Re: Parallel copy - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Parallel copy
Date
Msg-id CALj2ACWwSEELvf9H2EsuDS8=mfYmaODCJhGJLfCuBXxu6076Zg@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Parallel copy  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Parallel copy  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
List pgsql-hackers
On Tue, Sep 29, 2020 at 6:30 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> From the testing perspective,
> 1. Test by having something force_parallel_mode = regress which means
> that all existing Copy tests in the regression will be executed via
> new worker code. You can have this as a test-only patch for now and
> make sure all existing tests passed with this.
>

I don't think all the existing copy test cases(except the new test cases added in the parallel copy patch set) would run inside the parallel worker if force_parallel_mode is on. This is because, the parallelism will be picked up for parallel copy only if parallel option is specified unlike parallelism for select queries.

Anyways, I ran with force_parallel_mode on and regress. All copy related tests and make check/make check-world ran fine.

>
> 2. Do we have tests for toast tables? I think if you implement the
> previous point some existing tests might cover it but I feel we should
> have at least one or two tests for the same.
>

Toast table use case 1: 10000 tuples, 9.6GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
(222.767, 0, 1X), (134.171, 1, 1.66X), (93.749, 2, 2.38X), (93.672, 4, 2.38X), (94.827, 8, 2.35X), (93.766, 16, 2.37X), (98.153, 20, 2.27X), (122.721, 30, 1.81X)

Toast table use case 2: 100000 tuples, 96GB data, 3 indexes 2 on integer columns, 1 on text column(not the toast column), csv file, each row is > 1320KB:
(2255.032, 0, 1X), (1358.628, 1, 1.66X), (901.170, 2, 2.5X), (912.743, 4, 2.47X), (988.718, 8, 2.28X), (938.000, 16, 2.4X), (997.556, 20, 2.26X), (1000.586, 30, 2.25X)

Toast table use case3: 10000 tuples, 9.6GB, no indexes, binary file, each row is > 1320KB:
(136.983, 0, 1X), (136.418, 1, 1X), (81.896, 2, 1.66X), (62.929, 4, 2.16X), (52.311, 8, 2.6X), (40.032, 16, 3.49X), (44.097, 20, 3.09X), (62.310, 30, 2.18X)

In the case of a Toast table, we could achieve upto 2.5X for csv files, and 3.5X for binary files. We are analyzing this point and will post an update on our findings soon.

While testing for the Toast table case with a binary file, I discovered an issue with the earlier v6-0006-Parallel-Copy-For-Binary-Format-Files.patch from [1], I fixed it and added the updated v6-0006 patch here. Please note that I'm also attaching the 1 to 5 patches from version 6 just for completion, that have no change from what Vignesh sent earlier in [1].

>
> 3. Have we checked the code coverage of the newly added code with
> existing tests?
>

So far, we manually ensured that most of the code parts are covered(see below list of test cases). But we are also planning to do the code coverage using some tool in the coming days.

Apart from the above tests, I also captured performance measurement on the latest v6 patch set.

Use case 1: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, csv file
(1168.484, 0, 1X), (1116.442, 1, 1.05X), (641.272, 2, 1.82X), (338.963, 4, 3.45X), (202.914, 8, 5.76X), (139.884, 16, 8.35X), (128.955, 20, 9.06X), (131.898, 30, 8.86X)

Use case 2: 10million rows, 5.2GB data,2 indexes on integer columns, 1 index on text column, binary file
(1097.83, 0, 1X), (1095.735, 1, 1.002X), (625.610, 2, 1.75X), (319.833, 4, 3.43X), (186.908, 8, 5.87X), (132.115, 16, 8.31X), (128.854, 20, 8.52X), (134.965, 30, 8.13X)

Use case 2: 10million rows, 5.2GB data, 3 indexes on integer columns, csv file
(218.227, 0, 1X), (182.815, 1, 1.19X), (135.500, 2, 1.61), (113.954, 4, 1.91X), (106.243, 8, 2.05X), (101.222, 16, 2.15X), (100.378, 20, 2.17X), (100.351, 30, 2.17X)

All the above tests are performed on the latest v6 patch set (attached here in this thread) with custom postgresql.conf[1]. The results are of the triplet form (exec time in sec, number of workers, gain)

Overall, we have below test cases to cover the code and for performance measurements. We plan to run these tests whenever a new set of patches is posted.

1. csv
2. binary
3. force parallel mode = regress
4. toast data csv and binary
5. foreign key check, before row, after row, before statement, after statement, instead of triggers
6. partition case
7. foreign partitions and partitions having trigger cases
8. where clause having parallel unsafe and safe expression, default parallel unsafe and safe expression
9. temp, global, local, unlogged, inherited tables cases, foreign tables

[1] https://www.postgresql.org/message-id/CALDaNm29DJKy0-vozs8eeBRf2u3rbvPdZHCocrd0VjoWHS7h5A%40mail.gmail.com
[2]
shared_buffers = 40GB
max_worker_processes = 32
max_parallel_maintenance_workers = 24
max_parallel_workers = 32
synchronous_commit = off
checkpoint_timeout = 1d
max_wal_size = 24GB
min_wal_size = 15GB
autovacuum = off

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com
Attachment

pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Parallel INSERT (INTO ... SELECT ...)
Next
From: Christoph Berg
Date:
Subject: powerpc pg_atomic_compare_exchange_u32_impl: error: comparison of integer expressions of different signedness (Re: pgsql: For all ppc compilers, implement compare_exchange and) fetch_add