Thread: Improve the performance of nested loop join in the case of partitioned inner table
Improve the performance of nested loop join in the case of partitioned inner table
From
Alexandr Nikulin
Date:
Hi, hackers
I propose to slightly improve the performance of nested loop join in the case of partitioned inner table.
As I see in the code, the backend looks for the partition of the inner table each time after fetch a new row from the outer table.
These searches can take a significant amount of time.
But we can skip this step if the nested loop parameter(s) was(re) not changed since the previous row fetched from the outer table
Attachment
Re: Improve the performance of nested loop join in the case of partitioned inner table
From
David Rowley
Date:
On Thu, 23 Mar 2023 at 19:46, Alexandr Nikulin <alexandr.s.nikulin@gmail.com> wrote: > I propose to slightly improve the performance of nested loop join in the case of partitioned inner table. > As I see in the code, the backend looks for the partition of the inner table each time after fetch a new row from the outertable. > These searches can take a significant amount of time. > But we can skip this step if the nested loop parameter(s) was(re) not changed since the previous row fetched from the outertable I think if we were to do something like that, then it should be done in nodeAppend.c and nodeMergeAppend.c. That way you can limit it to only checking parameters that partition pruning needs to care about. That does mean you'd need to find somewhere to cache the parameter values, however. Doing it in nodeNestloop.c means you're doing it when the inner subplan is something that does not suffer from the additional overhead you want to avoid, e.g an Index Scan. Also, generally, if you want to get anywhere with a performance patch, you should post performance results from before and after your change. Also include your benchmark setup and relevant settings for how you got those results. For this case, you'll want a best case (parameter value stays the same) and a worst case, where the parameter value changes on each outer row. I expect you're going to add overhead to this case as your additional checks will always detect the parameter has changed as that'll always require partition pruning to be executed again. We'll want to know if that overhead is high enough for us not to want to do this. I'll be interested to see a test that as some varlena parameter of say a few hundred bytes to see how much overhead testing if that parameter has changed when the pruning is being done on a HASH partitioned table. HASH partitioning should prune quite a bit faster than both LIST and RANGE as the hashing is effectively O(1) vs O(log2 N) (N being the number of Datums in the partition bounds). I'd expect a measurable additional overhead with the patch when the parameter changes on each outer row. David
Re: Improve the performance of nested loop join in the case of partitioned inner table
From
Alexandr Nikulin
Date:
The following tests demonstrate the speedup which may be achieved with my patch:
1. Add to postgresql.conf
enable_hashjoin = OFF
enable_mergejoin = OFF
enable_material = OFF
enable_bitmapscan = OFF
enable_nestloop = ON
max_parallel_workers_per_gather = 0
enable_memoize = OFF
enable_mergejoin = OFF
enable_material = OFF
enable_bitmapscan = OFF
enable_nestloop = ON
max_parallel_workers_per_gather = 0
enable_memoize = OFF
2. create test tables:
create table test_part(id int primary key) partition by range(id);
create table test_part0 partition of test_part for values from (0) to (1000000);
create table test_part1 partition of test_part for values from (1000000) to (2000000);
insert into test_part select id from generate_series(0, 1000000-1) as id;
create table ids(id int, name varchar); create index on ids(ascii(name));
insert into ids select id, 'worst case' as name from generate_series(0, 1000000-1) as id;
insert into ids select 123456, 'best case' as name from generate_series(0, 1000000-1) as id;
create table test_part0 partition of test_part for values from (0) to (1000000);
create table test_part1 partition of test_part for values from (1000000) to (2000000);
insert into test_part select id from generate_series(0, 1000000-1) as id;
create table ids(id int, name varchar); create index on ids(ascii(name));
insert into ids select id, 'worst case' as name from generate_series(0, 1000000-1) as id;
insert into ids select 123456, 'best case' as name from generate_series(0, 1000000-1) as id;
3. run tests:
explain analyze select * from ids join test_part on ids.id=test_part.id where ascii(ids.name)=ascii('best case');
explain analyze select * from ids join test_part on ids.id=test_part.id where ascii(ids.name)=ascii('worst case');
The average results on my machine are as follows:
| vanila postgres | patched postgres
best case | 2286 ms | 1924 ms
worst case | 2278 ms | 2360 ms
So far I haven't refactored the patch as per David's advice. I just want to understand if we need such an optimization?
чт, 23 мар. 2023 г. в 17:05, David Rowley <dgrowleyml@gmail.com>:
On Thu, 23 Mar 2023 at 19:46, Alexandr Nikulin
<alexandr.s.nikulin@gmail.com> wrote:
> I propose to slightly improve the performance of nested loop join in the case of partitioned inner table.
> As I see in the code, the backend looks for the partition of the inner table each time after fetch a new row from the outer table.
> These searches can take a significant amount of time.
> But we can skip this step if the nested loop parameter(s) was(re) not changed since the previous row fetched from the outer table
I think if we were to do something like that, then it should be done
in nodeAppend.c and nodeMergeAppend.c. That way you can limit it to
only checking parameters that partition pruning needs to care about.
That does mean you'd need to find somewhere to cache the parameter
values, however. Doing it in nodeNestloop.c means you're doing it when
the inner subplan is something that does not suffer from the
additional overhead you want to avoid, e.g an Index Scan.
Also, generally, if you want to get anywhere with a performance patch,
you should post performance results from before and after your change.
Also include your benchmark setup and relevant settings for how you
got those results. For this case, you'll want a best case (parameter
value stays the same) and a worst case, where the parameter value
changes on each outer row. I expect you're going to add overhead to
this case as your additional checks will always detect the parameter
has changed as that'll always require partition pruning to be executed
again. We'll want to know if that overhead is high enough for us not
to want to do this.
I'll be interested to see a test that as some varlena parameter of say
a few hundred bytes to see how much overhead testing if that parameter
has changed when the pruning is being done on a HASH partitioned
table. HASH partitioning should prune quite a bit faster than both
LIST and RANGE as the hashing is effectively O(1) vs O(log2 N) (N
being the number of Datums in the partition bounds). I'd expect a
measurable additional overhead with the patch when the parameter
changes on each outer row.
David
Re: Improve the performance of nested loop join in the case of partitioned inner table
From
David Rowley
Date:
On Thu, 13 Apr 2023 at 03:00, Alexandr Nikulin <alexandr.s.nikulin@gmail.com> wrote: > explain analyze select * from ids join test_part on ids.id=test_part.id where ascii(ids.name)=ascii('best case'); > explain analyze select * from ids join test_part on ids.id=test_part.id where ascii(ids.name)=ascii('worst case'); > > The average results on my machine are as follows: > > | vanila postgres | patched postgres > best case | 2286 ms | 1924 ms > worst case | 2278 ms | 2360 ms > > So far I haven't refactored the patch as per David's advice. I just want to understand if we need such an optimization? My thoughts are that the worst-case numbers are not exactly great. I very much imagine that with average cases, it's much more likely than not that the parameter values from the nested loop's next outer row will be different than it is that it'll be the same. Let's say, roughly your numbers show a 20% speedup for the best case, and a 4% slowdown for the worst case, for us to break even with this patch as it is, the parameter value would have to be the same around 1 out of 5 times. That does not seem like good odds to bet on given we're likely working with data types that allow billions of distinct values. I think if you really wanted to make this work, then you'd need to get the planner on board with making the decision on if this should be done or not based on the n_distinct estimates from the outer side of the join. Either that or some heuristic in the executor that tries for a while and gives up if the parameter value changes too often. Some code was added in 3592e0ff9 that uses a heuristics approach to solving this problem by only enabling the optimisation if we hit the same partition at least 16 times and switches it off again as soon as the datum no longer matches the cached partition. I'm not quite sure how the same could be made to work here as with 3592e0ff9. A tuple only belongs to a single partition and we can very cheaply check if this partition is the same as the last one by checking if the partition index matches. With this case, since we're running a query, many partitions can remain after partition pruning runs, and checking that some large number of partitions match some other large number of partitions is not going to be as cheap as checking just two partitions match. Bitmapsets can help here, but they'll just never be as fast as checking two ints match. In short, I think you're going to have to come up with something very crafty here to reduce the overhead worst case. Whatever it is will need to be neat and self-contained, perhaps in execPartition.c. It just does not seem good to have logic related to partition pruning inside nodeNestloop.c. I'm going to mark this as waiting on author in the CF app. It might be better if you withdraw it and resubmit when you have a patch that addresses the worst-case regression issue. David
Re: Improve the performance of nested loop join in the case of partitioned inner table
From
Daniel Gustafsson
Date:
> On 4 Jul 2023, at 14:02, David Rowley <dgrowleyml@gmail.com> wrote: > I'm going to mark this as waiting on author in the CF app. It might be > better if you withdraw it and resubmit when you have a patch that > addresses the worst-case regression issue. Since there hasn't been any updates to this thread I am marking this returned with feedback. Please feel free to resubmit to the next CF when the comments from David have been addressed. -- Daniel Gustafsson