Re: [HACKERS] Parallel Append implementation - Mailing list pgsql-hackers

From Amit Khandekar
Subject Re: [HACKERS] Parallel Append implementation
Date
Msg-id CAJ3gD9cohvXTUCBeyMVu+1JtEt3L6aUSGRWFXU2vsuzniaZwfg@mail.gmail.com
Whole thread Raw
In response to Re: [HACKERS] Parallel Append implementation  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] Parallel Append implementation  (Rafia Sabih <rafia.sabih@enterprisedb.com>)
List pgsql-hackers
Thanks a lot Robert for the patch. I will have a look. Quickly tried
to test some aggregate queries with a partitioned pgbench_accounts
table, and it is crashing. Will get back with the fix, and any other
review comments.

Thanks
-Amit Khandekar

On 9 November 2017 at 23:44, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sat, Oct 28, 2017 at 5:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> No, because the Append node is *NOT* getting copied into shared
>> memory.  I have pushed a comment update to the existing functions; you
>> can use the same comment for this patch.
>
> I spent the last several days working on this patch, which had a
> number of problems both cosmetic and functional.  I think the attached
> is in better shape now, but it could certainly use some more review
> and testing since I only just finished modifying it, and I modified it
> pretty heavily.  Changes:
>
> - I fixed the "morerows" entries in the documentation.  If you had
> built the documentation the way you had it and loaded up in a web
> browser, you would have seen that the way you had it was not correct.
>
> - I moved T_AppendState to a different position in the switch inside
> ExecParallelReInitializeDSM, so as to keep that switch in the same
> order as all of the other switch statements in that file.
>
> - I rewrote the comment for pa_finished.  It previously began with
> "workers currently executing the subplan", which is not an accurate
> description. I suspect this was a holdover from a previous version of
> the patch in which this was an array of integers rather than an array
> of type bool.  I also fixed the comment in ExecAppendEstimate and
> added, removed, or rewrote various other comments as well.
>
> - I renamed PA_INVALID_PLAN to INVALID_SUBPLAN_INDEX, which I think is
> more clear and allows for the possibility that this sentinel value
> might someday be used for non-parallel-aware Append plans.
>
> - I largely rewrote the code for picking the next subplan.  A
> superficial problem with the way that you had it is that you had
> renamed exec_append_initialize_next to exec_append_seq_next but not
> updated the header comment to match.  Also, the logic was spread out
> all over the file.  There are three cases: not parallel aware, leader,
> worker.  You had the code for the first case at the top of the file
> and the other two cases at the bottom of the file and used multiple
> "if" statements to pick the right one in each case.  I replaced all
> that with a function pointer stored in the AppendState, moved the code
> so it's all together, and rewrote it in a way that I find easier to
> understand.  I also changed the naming convention.
>
> - I renamed pappend_len to pstate_len and ParallelAppendDescData to
> ParallelAppendState.  I think the use of the word "descriptor" is a
> carryover from the concept of a scan descriptor.  There's nothing
> really wrong with inventing the concept of an "append descriptor", but
> it seems more clear to just refer to shared state.
>
> - I fixed ExecAppendReInitializeDSM not to reset node->as_whichplan.
> Per commit 41b0dd987d44089dc48e9c70024277e253b396b7, that's wrong;
> instead, local state should be reset in ExecReScanAppend.  I installed
> what I believe to be the correct logic in that function instead.
>
> - I fixed list_qsort() so that it copies the type of the old list into
> the new list.  Otherwise, sorting a list of type T_IntList or
> T_OidList would turn it into just plain T_List, which is wrong.
>
> - I removed get_append_num_workers and integrated the logic into the
> callers.  This function was coded quite strangely: it assigned the
> return value of fls() to a double and then eventually rounded the
> result back to an integer.  But fls() returns an integer, so this
> doesn't make much sense.  On a related note, I made it use fls(# of
> subpaths) instead of fls(# of subpaths)+1.  Adding 1 doesn't make
> sense to me here because it leads to a decision to use 2 workers for a
> single, non-partial subpath.  I suspect both of these mistakes stem
> from thinking that fls() returns the base-2 logarithm, but in fact it
> doesn't, quite: log2(1) = 0.0 but fls(1) = 1.
>
> - In the process of making the changes described in the previous
> point, I added a couple of assertions, one of which promptly failed.
> It turns out the reason is that your patch didn't update
> accumulate_append_subpaths(), which can result in flattening
> non-partial paths from a Parallel Append into a parent Append's list
> of partial paths, which is bad.  The easiest way to fix that would be
> to just teach accumulate_append_subpaths() not to flatten a Parallel
> Append into a parent Append or MergeAppend node, but it seemed to me
> that there was a fair amount of duplication between
> accumulate_partialappend_subpath() and accumulate_append_subpaths, so
> what I did instead is folded all of the necessarily logic directly
> into accumulate_append_subpaths().  This approach also avoids some
> assumptions that your code made, such as the assumption that we will
> never have a partial MergeAppend path.
>
> - I changed create_append_path() so that it uses the parallel_aware
> argument as the only determinant of whether the output path is flagged
> as parallel-aware. Your version also considered whether
> parallel_workers > 0, but I think that's not a good idea; the caller
> should pass the correct value for parallel_aware rather than relying
> on this function to fix it.  Possibly you indirectly encountered the
> problem mentioned in the previous point and worked around it like
> this, or maybe there was some other reason, but it doesn't seem to be
> necessary.
>
> - I changed things around to enforce the rule that all partial paths
> added to an appendrel must use the same row count estimate.  (This is
> not a new coding rule, but this patch provides a new way to violate
> it.) I did that by forcing the row-count for any parallel append
> mixing partial and non-partial paths to use the same row count as the
> row already added. I also changed the way the row count is calculated
> in the case where the only parallel append path mixes partial and
> non-partial plans; I think this way is more consistent with what we've
> done elsewhere.  This amounts to the assumption that we're trying to
> estimate the average number of rows per worker rather than the largest
> possible number; I'm not sure what the best thing to do here is in
> theory, but one advantage of this approach is that I think it will
> produce answers closer to the value we get for an all-partial-paths
> append.  That's good, because we don't want the row-count estimate to
> change precipitously based on whether an all-partial-paths append is
> possible.
>
> - I fixed some whitespace problems by running pgindent on various
> files and manually breaking some long lines.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

pgsql-hackers by date:

Previous
From: Fabien COELHO
Date:
Subject: Re: [HACKERS] Variable substitution in psql backtick expansion
Next
From: Noah Misch
Date:
Subject: Re: [HACKERS] Building PL/Perl with ActiveState Perl 5.22 and MSVC