Home > mailing lists

[HACKERS] Parallel COPY FROM execution - Mailing list pgsql-hackers

From	Alex K
Subject	[HACKERS] Parallel COPY FROM execution
Date	June 30, 2017 15:23:02
Msg-id	CADfU8Wz28_Xaj3fsHT3tTWiXST5sdQ3Myte9Jjx4hZgwvbwhzw@mail.gmail.com Whole thread Raw
Responses	Re: [HACKERS] Parallel COPY FROM execution
List	pgsql-hackers

Tree view

Greetings pgsql-hackers,

I am a GSOC student this year, my initial proposal has been discussed
in the following thread
https://www.postgresql.org/message-id/flat/7179F2FD-49CE-4093-AE14-1B26C5DFB0DA%40gmail.com

Patch with COPY FROM errors handling seems to be quite finished, so
I have started thinking about parallelism in COPY FROM, which is the next
point in my proposal.

In order to understand are there any expensive calls in COPY, which
can be executed in parallel, I did a small research. First, please, find
flame graph of the most expensive copy.c calls during the 'COPY FROM file'
attached (copy_from.svg). It reveals, that inevitably serial operations like
CopyReadLine (<15%), heap_multi_insert (~15%) take less than 50% of
time in summary, while remaining operations like heap_form_tuple and
multiple checks inside NextCopyFrom probably can be executed well in parallel.

Second, I have compared an execution time of 'COPY FROM a single large
file (~300 MB, 50000000 lines)' vs. 'COPY FROM four equal parts of the
original file executed in the four parallel processes'. Though it is a
very rough test, it helps to obtain an overall estimation:

Serial:
real 0m56.571s
user 0m0.005s
sys 0m0.006s

Parallel (x4):
real 0m22.542s
user 0m0.015s
sys 0m0.018s

Thus, it results in a ~60% performance boost per each x2 multiplication of
parallel processes, which is consistent with the initial estimation.

After several discussions I have two possible solutions on my mind:

1) Simple solution

Let us focus only on the 'COPY FROM file', then it is relatively easy to
implement, just give the same file and offset to each worker.

++ Simple; more reliable solution; probably it will give us the most possible
     performance boost

- - Limited number of use cases. Though 'COPY FROM file' is a frequent case,
    even when one use it with psql \copy, client-side file read and stdin
    streaming to the backend are actually performed

2) True parallelism

Implement a pool of bg_workers and simple shared_buffer/query. While main
COPY process will read an input data and put raw lines into the query, parallel
bg_workers will take lines from there and process.

++ More general solution; support of various COPY FROM use-cases

- - Much more sophisticated solution; probably less performance boost
    compared to 1)

I will be glad to any comments and criticism.


Alexey

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Attachment

copy_from.svg

pgsql-hackers by date:

From: Jeevan Ladhe
Date: 30 June 2017, 15:18:53
Subject: Re: [HACKERS] Adding support for Default partition in partitioning

From: Pavel Stehule
Date: 30 June 2017, 15:35:46
Subject: Re: [HACKERS] Parallel COPY FROM execution

[HACKERS] Parallel COPY FROM execution - Mailing list pgsql-hackers

Attachment

Previous

Next