Re: Parallel copy - Mailing list pgsql-hackers

From vignesh C
Subject Re: Parallel copy
Date
Msg-id CALDaNm0Y_PZvB7jO_gFd2C1nmN0_kSFFb+gA+DqrekPidN-pnQ@mail.gmail.com
Whole thread Raw
In response to Re: Parallel copy  (Greg Nancarrow <gregn4422@gmail.com>)
Responses Re: Parallel copy  (Greg Nancarrow <gregn4422@gmail.com>)
List pgsql-hackers
On Tue, Sep 1, 2020 at 3:39 PM Greg Nancarrow <gregn4422@gmail.com> wrote:
>
> Hi Vignesh,
>
> >Can you share with me the script you used to generate the data & the ddl of the table, so that it will help me check
that>scenario you faced the >problem.
 
>
> Unfortunately I can't directly share it (considered company IP),
> though having said that it's only doing something that is relatively
> simple and unremarkable, so I'd expect it to be much like what you are
> currently doing. I can describe it in general.
>
> The table being used contains 100 columns (as I pointed out earlier),
> with the first column of "bigserial" type, and the others of different
> types like "character varying(255)", "numeric", "date" and "time
> without timezone". There's about 60 of the "character varying(255)"
> overall, with the other types interspersed.
>
> When testing with indexes, 4 b-tree indexes were used that each
> included the first column and then distinctly 9 other columns.
>
> A CSV record (row) template file was created with test data
> (corresponding to the table), and that was simply copied and appended
> over and over with a record prefix in order to create the test data
> file.
> The following shell-script basically does it (but very slowly). I was
> using a small C program to do similar, a lot faster.
> In my case, N=2550000 produced about a 5GB CSV file.
>
>     file_out=data.csv; for i in {1..N}; do echo -n "$i," >> $file_out;
> cat sample_record.csv >> $file_out; done
>
> One other thing I should mention is that between each test run, I
> cleared the OS page cache, as described here:
> https://linuxhint.com/clear_cache_linux/
> That way, each COPY FROM is not taking advantage of any OS-cached data
> from a previous COPY FROM.

I will try with a similar test and check if I can reproduce.

> If your data is somehow significantly different and you want to (and
> can) share your script, then I can try it in my environment.

I have attached the scripts that I used for the test results I
mentioned in my previous mail. create.sql file has the table that I
used, insert_data_gen.txt has the insert data generation scripts. I
varied the count in insert_data_gen to generate csv files of 1GB, 2GB
& 5GB & varied the data to generate 1 char, 10 char & 100 char for
each column for various testing. You can rename insert_data_gen.txt to
insert_data_gen.sh & generate the csv file.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: builtin functions, parameter names and psql's \df
Next
From: Michael Paquier
Date:
Subject: Re: More tests with USING INDEX replident and dropped indexes