Re: Parallel copy - Mailing list pgsql-hackers

From vignesh C
Subject Re: Parallel copy
Date
Msg-id CALDaNm3NNEmFTzJM1hiOsKrxrRecWp=sLzcG4fqHZwf+WDP1_A@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
On Wed, Feb 26, 2020 at 4:24 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Tue, Feb 25, 2020 at 9:30 PM Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
> >
> > On Sun, Feb 23, 2020 at 05:09:51PM -0800, Andres Freund wrote:
> > >Hi,
> > >
> > >> The one piece of information I'm missing here is at least a very rough
> > >> quantification of the individual steps of CSV processing - for example
> > >> if parsing takes only 10% of the time, it's pretty pointless to start by
> > >> parallelising this part and we should focus on the rest. If it's 50% it
> > >> might be a different story. Has anyone done any measurements?
> > >
> > >Not recently, but I'm pretty sure that I've observed CSV parsing to be
> > >way more than 10%.
> > >
> >
> > Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> > so I still think we need to do some measurements first.
> >
>
> Agreed.
>
> > I'm willing to
> > do that, but (a) I doubt I'll have time for that until after 2020-03,
> > and (b) it'd be good to agree on some set of typical CSV files.
> >
>
> Right, I don't know what is the best way to define that.  I can think
> of the below tests.
>
> 1. A table with 10 columns (with datatypes as integers, date, text).
> It has one index (unique/primary). Load with 1 million rows (basically
> the data should be probably 5-10 GB).
> 2. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). Load with 1
> million rows (basically the data should be probably 5-10 GB).
> 3. A table with 10 columns (with datatypes as integers, date, text).
> It has three indexes, one index can be (unique/primary). It has before
> and after trigeers. Load with 1 million rows (basically the data
> should be probably 5-10 GB).
> 4. A table with 10 columns (with datatypes as integers, date, text).
> It has five or six indexes, one index can be (unique/primary). Load
> with 1 million rows (basically the data should be probably 5-10 GB).
>

I have tried to capture the execution time taken for 3 scenarios which I felt could give a fair idea:
Test1 (Table with 3 indexes and 1 trigger)
Test2 (Table with 2 indexes)
Test3 (Table without indexes/triggers)

I have captured the following details:
File Read time - time taken to read the file from CopyGetData function.
Read line Time -  time taken to read line from NextCopyFrom function(read time & tokenise time excluded)
Tokenize Time - time taken to tokenize the contents from NextCopyFromRawFields function.
Data Execution Time - remaining execution time from the total time

The execution breakdown for the tests are  given below:
Test/ Time(In Seconds)Total TimeFile Read TimeRead line /Buffer Read TimeTokenize TimeData Execution Time
Test11693.3690.25634.1735.5781653.362
Test2736.0960.28839.7626.525689.521
Test3112.060.26639.1896.43366.172

Steps for the scenarios:
Test1(Table with 3 indexes and 1 trigger):
CREATE TABLE census2 (year int,age int,ethnic int,sex int,area text,count text);
CREATE TABLE census3(year int,age int,ethnic int,sex int,area text,count text);

CREATE INDEX idx1_census2 on census2(year);
CREATE INDEX idx2_census2 on census2(age);
CREATE INDEX idx2_census2 on census2(ethnic);

CREATE or REPLACE FUNCTION census2_afterinsert()
RETURNS TRIGGER
AS $$
BEGIN
  INSERT INTO census3  SELECT * FROM census2 limit 1;
  RETURN NEW;
END;
$$
LANGUAGE plpgsql;

CREATE TRIGGER census2_trigger AFTER INSERT  ON census2 FOR EACH ROW EXECUTE PROCEDURE census2_afterinsert();
COPY census2 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Test2 (Table with 2 indexes):
CREATE TABLE census1 (year int,age int,ethnic int,sex int,area text,count text);
CREATE INDEX idx1_census1 on census1(year);
CREATE INDEX idx2_census1 on census1(age);
COPY census1 FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Test3 (Table without indexes/triggers):
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
COPY census FROM 'Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Note: The Data8277.csv used was the same data that Ants aasma had used.

From the above result we could infer that Read line will have to be done sequentially. Read line time takes about 2.01%, 5.40% and 34.97%of the total time. I felt we will be able to parallelise the remaining  phases of the copy. The performance improvement will vary based on the scenario(indexes/triggers), it will be proportionate to the number of indexes and triggers. Read line can also be parallelised in txt format(non csv). I feel parallelising copy could give significant improvement in quite some scenarios.

Further I'm planning to see how the execution will be for toast table. I'm also planning to do test on RAM disk where I will configure the data on RAM disk, so that we can further eliminate the I/O cost.

Thoughts?

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

pgsql-hackers by date:

Previous
From: "Shinoda, Noriyoshi (PN Japan A&PS Delivery)"
Date:
Subject: RE: pg_stat_progress_basebackup - progress reporting forpg_basebackup, in the server side
Next
From: Fujii Masao
Date:
Subject: Re: pg_stat_progress_basebackup - progress reporting forpg_basebackup, in the server side