Thread: Parallel copy

Parallel copy

From

Amit Kapila

Date:

14 February 2020, 08:11:54

This work is to parallelize the copy command and in particular "Copy
<table_name> from 'filename' Where <condition>;" command.

Before going into how and what portion of 'copy command' processing we
can parallelize, let us see in brief what are the top-level operations
we perform while copying from the file into a table. We read the file
in 64KB chunks, then find the line endings and process that data line
by line, where each line corresponds to one tuple. We first form the
tuple (in form of value/null array) from that line, check if it
qualifies the where condition and if it qualifies, then perform
constraint check and few other checks and then finally store it in
local tuple array. Once we reach 1000 tuples or consumed 64KB
(whichever occurred first), we insert them together via
table_multi_insert API and then for each tuple insert into the
index(es) and execute after row triggers.

So if we see here we do a lot of work after reading each 64K chunk.
We can read the next chunk only after all the tuples are processed in
the previous chunk we read. This brings us an opportunity to
parallelize each 64K chunk processing. I think we can do this in more
than one way.

The first idea is that we allocate each chunk to a worker and once the
worker has finished processing the current chunk, it can start with
the next unprocessed chunk. Here, we need to see how to handle the
partial tuples at the end or beginning of each chunk. We can read the
chunks in dsa/dsm instead of in local buffer for processing.
Alternatively, if we think that accessing shared memory can be costly
we can read the entire chunk in local memory, but copy the partial
tuple at the beginning of a chunk (if any) to a dsa. We mainly need
partial tuple in the shared memory area. The worker which has found
the initial part of the partial tuple will be responsible to process
the entire tuple. Now, to detect whether there is a partial tuple at
the beginning of a chunk, we always start reading one byte, prior to
the start of the current chunk and if that byte is not a terminating
line byte, we know that it is a partial tuple. Now, while processing
the chunk, we will ignore this first line and start after the first
terminating line.

To connect the partial tuple in two consecutive chunks, we need to
have another data structure (for the ease of reference in this email,
I call it CTM (chunk-tuple-map)) in shared memory where we store some
per-chunk information like the chunk-number, dsa location of that
chunk and a variable which indicates whether we can free/reuse the
current entry. Whenever we encounter the partial tuple at the
beginning of a chunk we note down its chunk number, and dsa location
in CTM. Next, whenever we encounter any partial tuple at the end of
the chunk, we search CTM for next chunk-number and read from
corresponding dsa location till we encounter terminating line byte.
Once we have read and processed this partial tuple, we can mark the
entry as available for reuse. There are some loose ends here like how
many entries shall we allocate in this data structure. It depends on
whether we want to allow the worker to start reading the next chunk
before the partial tuple of the previous chunk is processed. To keep
it simple, we can allow the worker to process the next chunk only when
the partial tuple in the previous chunk is processed. This will allow
us to keep the entries equal to a number of workers in CTM. I think
we can easily improve this if we want but I don't think it will matter
too much as in most cases by the time we processed the tuples in that
chunk, the partial tuple would have been consumed by the other worker.

Another approach that came up during an offlist discussion with Robert
is that we have one dedicated worker for reading the chunks from file
and it copies the complete tuples of one chunk in the shared memory
and once that is done, a handover that chunks to another worker which
can process tuples in that area. We can imagine that the reader
worker is responsible to form some sort of work queue that can be
processed by the other workers. In this idea, we won't be able to get
the benefit of initial tokenization (forming tuple boundaries) via
parallel workers and might need some additional memory processing as
after reader worker has handed the initial shared memory segment, we
need to somehow identify tuple boundaries and then process them.

Another thing we need to figure out is the how many workers to use for
the copy command. I think we can use it based on the file size which
needs some experiments or may be based on user input.

I think we have two related problems to solve for this (a) relation
extension lock (required for extending the relation) which won't
conflict among workers due to group locking, we are working on a
solution for this in another thread [1], (b) Use of Page locks in Gin
indexes, we can probably disallow parallelism if the table has Gin
index which is not a great thing but not bad either.

To be clear, this work is for PG14.

Thoughts?

[1] - https://www.postgresql.org/message-id/CAD21AoCmT3cFQUN4aVvzy5chw7DuzXrJCbrjTU05B%2BSs%3DGn1LA%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Test/ Time(In Seconds)	Total Time	File Read Time	Read line /Buffer Read Time	Tokenize Time	Data Execution Time
Test1	1693.369	0.256	34.173	5.578	1653.362
Test2	736.096	0.288	39.762	6.525	689.521
Test3	112.06	0.266	39.189	6.433	66.172

Workers/ Exec time (seconds)	copy from file, 2 indexes on integer columns 1 index on text column	copy from stdin, 2 indexes on integer columns 1 index on text column	copy from file, 1 gist index on text column	copy from file, 3 indexes on integer columns	copy from stdin, 3 indexes on integer columns
0	1162.772(1X)	1176.035(1X)	827.669(1X)	216.171(1X)	217.376(1X)
1	1110.288(1.05X)	1120.556(1.05X)	747.384(1.11X)	174.242(1.24X)	163.492(1.33X)
2	635.249(1.83X)	668.18(1.76X)	435.673(1.9X)	133.829(1.61X)	126.516(1.72X)
4	336.835(3.45X)	346.768(3.39X)	236.406(3.5X)	105.767(2.04X)	107.382(2.02X)
8	188.577(6.17X)	194.491(6.04X)	148.962(5.56X)	100.708(2.15X)	107.72(2.01X)
16	126.819(9.17X)	146.402(8.03X)	119.923(6.9X)	97.996(2.2X)	106.531(2.04X)
20	117.845(9.87X)	149.203(7.88X)	138.741(5.96X)	97.94(2.21X)	107.5(2.02)
30	127.554(9.11X)	161.218(7.29X)	172.443(4.8X)	98.232(2.2X)	108.778(1.99X)

parallel workers	test case 1(exec time in sec): copy from binary file, 2 indexes on integer columns and 1 index on text column	test case 2(exec time in sec): copy from binary file, 1 gist index on text column	test case 3(exec time in sec): copy from binary file, 3 indexes on integer columns
0	1106.899(1X)	772.758(1X)	171.338(1X)
1	1094.165(1.01X)	757.365(1.02X)	163.018(1.05X)
2	618.397(1.79X)	428.304(1.8X)	117.508(1.46X)
4	320.511(3.45X)	231.938(3.33X)	80.297(2.13X)
8	172.462(6.42X)	150.212(5.14X)	71.518(2.39X)
16	110.460(10.02X)	124.929(6.18X)	91.308(1.88X)
20	98.470(11.24X)	137.313(5.63X)	95.289(1.79X)
30	109.229(10.13X)	173.54(4.45X)	95.799(1.78X)

Test	Without parallel mode	With 1 Parallel worker
1GB csv file 100 columns (100 bytes data in each column)	62 seconds	47 seconds (1.32X)
1GB csv file 100 columns (1000 bytes data in each column)	89 seconds	78 seconds (1.14X)
2GB csv file 100 columns (1 byte data in each column)	277 seconds	256 seconds (1.08X)
5GB csv file 100 columns (100 byte data in each column)	515 seconds	445 seconds (1.16X)

Test/ Time(In Seconds)	Total TIme	File Read Time	copyreadline Time	Remaining Execution Time	Read line percentage
Test1(3 index + 1 trigger)	2099.017	0.311	10.256	2088.45	0.4886096682
Test2(2 index)	657.994	0.303	10.171	647.52	1.545758776
Test3(no index, no trigger)	112.41	0.296	10.996	101.118	9.782047861
Test4(toast)	360.028	1.43	46.556	312.042	12.93121646

Test/ Time(In Seconds)	Total TIme	File Read Time	copyreadline Time	Remaining Execution Time	Read line percentage
Test1(3 index + 1 trigger)	1571.558	0.259	6.986	1564.313	0.4445270235
Test2(2 index)	369.942	0.263	6.848	362.831	1.851100983
Test3(no index, no trigger)	54.077	0.239	6.805	47.033	12.58390813
Test4(toast)	96.323	0.918	26.603	68.802	27.61853348

parallel workers	test case 1(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, 3 indexes on integer columns unique data	test case 2(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, unique data
0	205.403(1X)	135(1X)
2	114.724(1.79X)	59.388(2.27X)
4	99.017(2.07X)	56.742(2.34X)
8	99.722(2.06X)	66.323(2.03X)
16	98.147(2.09X)	66.054(2.04X)
20	97.723(2.1X)	66.389(2.03X)
30	97.048(2.11X)	70.568(1.91X)

parallel workers	test case 1(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, 3 indexes on integer columns unique data	test case 2(exec time in sec): copy from csv file, 5.1GB, 10million tuples, 4 range partitions, unique data
0	205.403(1X)	135(1X)
2	114.724(1.79X)	59.388(2.27X)
4	99.017(2.07X)	56.742(2.34X)
8	99.722(2.06X)	66.323(2.03X)
16	98.147(2.09X)	66.054(2.04X)
20	97.723(2.1X)	66.389(2.03X)
30	97.048(2.11X)	70.568(1.91X)

Thread: Parallel copy

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment

Attachment