Re: Parallel copy - Mailing list pgsql-hackers
From | vignesh C |
---|---|
Subject | Re: Parallel copy |
Date | |
Msg-id | CALDaNm074KxPB7PbCsE1PkaQPoJUeMtXyT5w0SPszUHmPjPxyQ@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel copy (Heikki Linnakangas <hlinnaka@iki.fi>) |
List | pgsql-hackers |
Thanks Heikki for reviewing and providing your comments. Please find my thoughts below. On Fri, Oct 23, 2020 at 2:01 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > > I had a brief look at at this patch. Important work! A couple of first > impressions: > > 1. The split between patches > 0002-Framework-for-leader-worker-in-parallel-copy.patch and > 0003-Allow-copy-from-command-to-process-data-from-file.patch is quite > artificial. All the stuff introduced in the first is unused until the > second patch is applied. The first patch introduces a forward > declaration for ParallelCopyData(), but the function only comes in the > second patch. The comments in the first patch talk about > LINE_LEADER_POPULATING and LINE_LEADER_POPULATED, but the enum only > comes in the second patch. I think these have to merged into one. If you > want to split it somehow, I'd suggest having a separate patch just to > move CopyStateData from copy.c to copy.h. The subsequent patch would > then be easier to read as you could see more easily what's being added > to CopyStateData. Actually I think it would be better to have a new > header file, copy_internal.h, to hold CopyStateData and the other > structs, and keep copy.h as it is. > I have merged 0002 & 0003 patch, I have moved few things like creation of copy_internal.h, moving of CopyStateData from copy.c into copy_internal.h into 0001 patch. > 2. This desperately needs some kind of a high-level overview of how it > works. What is a leader, what is a worker? Which process does each step > of COPY processing, like reading from the file/socket, splitting the > input into lines, handling escapes, calling input functions, and > updating the heap and indexes? What data structures are used for the > communication? How does is the work synchronized between the processes? > There are comments on those individual aspects scattered in the patch, > but if you're not already familiar with it, you don't know where to > start. There's some of that in the commit message, but it needs to be > somewhere in the source code, maybe in a long comment at the top of > copyparallel.c. > Added it in copyparallel.c > 3. I'm surprised there's a separate ParallelCopyLineBoundary struct for > every input line. Doesn't that incur a lot of synchronization overhead? > I haven't done any testing, this is just my gut feeling, but I assumed > you'd work in batches of, say, 100 or 1000 lines each. > Data read from the file will be stored in DSM which is of size 64k * 1024. Leader will parse and identify the line boundary like which line starts from which data block, what is the starting offset in the data block, what is the line size, this information will be present in ParallelCopyLineBoundary. Like you said, each worker processes WORKER_CHUNK_COUNT 64 lines at a time. Performance test results run for parallel copy are available at [1]. This is addressed in v9 patch shared at [2]. [1] https://www.postgresql.org/message-id/CALj2ACWeQVd-xoQZHGT01_33St4xPoZQibWz46o7jW1PE3XOqQ%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CALDaNm1cAONkFDN6K72DSiRpgqNGvwxQL7TjEiHZ58opnp9VoA@mail.gmail.com Regards, Vignesh EnterpriseDB: http://www.enterprisedb.com
pgsql-hackers by date: