RE: speeding up planning with partitions - Mailing list pgsql-hackers

From Kato, Sho
Subject RE: speeding up planning with partitions
Date
Msg-id 25C1C6B2E7BE044889E4FE8643A58BA963AF581E@G01JPEXMBKW03
Whole thread Raw
In response to speeding up planning with partitions  (Amit Langote <Langote_Amit_f8@lab.ntt.co.jp>)
Responses Re: speeding up planning with partitions
List pgsql-hackers
Hi, Amit

Great!
With the total number of records being 6400, I benchmarked while increasing the number of partitions from 100 to 6400.
Applying three all patches, 20% performance improved for 100 partitions.

I have the same number of records for each partition, do you do the same?

Also, in my case, performance was better when not prepare.
I think these patches do not improve execute case, so we need faster runtime pruning patch[1], right?

Details of measurement conditions and results are as follows.

- base source
  master(@777e6ddf17) + Speeding up Insert v8 patch[1]

- table definition(e.g. 100 partition)
  create table test.accounts(aid serial, abalance int) partition by range(aid);
  create table test.accounts_history(id serial, aid int, delta int, mtime timestamp without time zone) partition by
range(aid);
  
  create table test.account_part_1 partition of test.accounts for values from (1) to (65);
  create table test.account_part_2 partition of test.accounts for values from (65) to (129);
  ...
  create table test.account_part_100 partition of test.accounts for values from (6337) to (6400);
  
  create table test.ah_part_1 partition of test.accounts_history for values from (1) to (65);
  create table test.ah_part_2 partition of test.accounts_history for values from (65) to (129);
  ...
  create table test.ah_part_100 partition of test.accounts_history for values from (6337) to (6400);

- benchmark script
  \set aid random(1, 6400)
  \set delta random(-5000, 5000)
  BEGIN;
  UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
  SELECT abalance FROM test.accounts WHERE aid = :aid;
  INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
  END;

- command option
  pgbench -d testdb -f benchmark.pgbench -T 180 -r -n -M prepare
  pgbench -d testdb -f benchmark.pgbench -T 180 -r -n
  
-results
  base source no prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency 
  ----------+------------+----------------+----------------+----------------
        100 | 662.414805 |          0.357 |          0.265 |          0.421
        200 | 494.478431 |          0.439 |          0.349 |          0.579
        400 | 307.982089 |          0.651 |          0.558 |          0.927
        800 | 191.360676 |          0.979 |          0.876 |          1.548
       1600 |  75.344947 |          2.253 |          2.003 |          4.301
       3200 |  30.643902 |          5.716 |          4.955 |         10.118
       6400 |  16.726056 |         12.512 |          8.582 |         18.054

  0001 no prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency 
  ----------+------------+----------------+----------------+----------------
        100 | 429.816329 |          0.811 |           0.75 |          0.365
        200 | 275.211531 |          1.333 |          1.248 |          0.501
        400 | 160.499833 |          2.384 |          2.252 |          0.754
        800 |  79.387776 |          4.935 |          4.698 |          1.468
       1600 |  24.787377 |         16.593 |         15.954 |          4.302
       3200 |   9.846421 |          42.96 |         42.139 |          8.848
       6400 |   4.919772 |          87.43 |         83.109 |          16.56

  0001 prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency 
  ----------+------------+----------------+----------------+----------------
        100 | 245.100776 |          2.728 |          0.374 |          0.476
        200 | 140.249283 |          5.116 |          0.603 |          0.686
        400 |  67.608559 |         11.383 |          1.055 |          1.179
        800 |  23.085806 |         35.781 |          2.585 |          2.677
       1600 |   6.211247 |        141.093 |          7.096 |          6.785
       3200 |   1.808214 |        508.045 |         15.741 |         13.243
       6400 |   0.495635 |       1919.362 |         37.691 |         28.177

  0001 + 0002 no prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency 
  ----------+------------+----------------+----------------+----------------
        100 |  682.53091 |          0.388 |           0.35 |           0.35
        200 | 469.906601 |          0.543 |          0.496 |           0.51
        400 | 321.915349 |           0.78 |          0.721 |          0.752
        800 | 201.620975 |          1.246 |          1.156 |          1.236
       1600 |  94.438204 |          2.612 |          2.335 |          2.745
       3200 |  38.292922 |          6.657 |          5.579 |          6.808
       6400 |   21.48462 |         11.989 |         10.104 |         12.601

  0001 + 0002 prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency 
  ----------+------------+----------------+----------------+----------------
        100 |  591.10863 |          0.433 |          0.342 |          0.422
        200 | 393.223638 |          0.625 |          0.522 |          0.614
        400 | 253.672736 |          0.946 |          0.828 |          0.928
        800 | 146.840262 |          1.615 |          1.448 |          1.604
       1600 |  52.805593 |          4.656 |          3.811 |          4.473
       3200 |  21.461606 |          11.48 |           9.56 |         10.661
       6400 |  11.888232 |         22.869 |         16.841 |         18.871
       
  0001 + 0002 + 0003 no prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency 
  ----------+------------+----------------+----------------+----------------
        100 | 798.962029 |          0.304 |          0.267 |          0.339
        200 | 577.893396 |          0.384 |          0.346 |          0.487
        400 | 426.542177 |          0.472 |          0.435 |          0.717
        800 | 288.616213 |           0.63 |          0.591 |          1.162
       1600 | 154.247034 |          1.056 |          0.987 |          2.384
       3200 |  59.711446 |          2.416 |          2.233 |          6.514
       6400 |  37.109761 |          3.387 |          3.099 |         11.762
       
  0001 + 0002 + 0003 prepared
   part_num |   tps_ex   | update_latency | select_latency | insert_latency 
  ----------+------------+----------------+----------------+----------------
        100 | 662.414805 |          0.357 |          0.265 |          0.421
        200 | 494.478431 |          0.439 |          0.349 |          0.579
        400 | 307.982089 |          0.651 |          0.558 |          0.927
        800 | 191.360676 |          0.979 |          0.876 |          1.548
       1600 |  75.344947 |          2.253 |          2.003 |          4.301
       3200 |  30.643902 |          5.716 |          4.955 |         10.118
       6400 |  16.726056 |         12.512 |          8.582 |         18.054
       
       
  Although it may not be related to this, when measured with pg11 beta2, somehow the performance was better.

  11beta2 + v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch[3] prepared
   part_num |   tps_ex    | update_latency | select_latency | insert_latency 
  ----------+-------------+----------------+----------------+----------------
     100    | 756.07228   |          0.942 |          0.091 |          0.123
   
[1] https://www.postgresql.org/message-id/CAKJS1f_QN-nmf6jCQ4gRU_8ab0zrd0ms-U%3D_Dj0KUARJiuGpOA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAKJS1f9T_32Xpb-p8cWwo5ezSfVhXviUW8QTWncP8ksPHDRK8g%40mail.gmail.com
[3] https://www.postgresql.org/message-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz%3DGVBwvGh4a6xA%40mail.gmail.com

regards,
Sho Kato
-----Original Message-----
From: Amit Langote [mailto:Langote_Amit_f8@lab.ntt.co.jp] 
Sent: Wednesday, August 29, 2018 9:06 PM
To: Pg Hackers <pgsql-hackers@postgresql.org>
Subject: speeding up planning with partitions

It is more or less well known that the planner doesn't perform well with more than a few hundred partitions even when
onlya handful of partitions are ultimately included in the plan.  Situation has improved a bit in PG
 
11 where we replaced the older method of pruning partitions one-by-one using constraint exclusion with a much faster
methodthat finds relevant partitions by using partitioning metadata.  However, we could only use it for SELECT queries,
becauseUPDATE/DELETE are handled by a completely different code path, whose structure doesn't allow it to call the new
pruningmodule's functionality.  Actually, not being able to use the new pruning is not the only problem for
UPDATE/DELETE,more on which further below.
 

While situation improved with new pruning where it could, there are still overheads in the way planner handles
partitions. As things stand today, it will spend cycles and allocate memory for partitions even before pruning is
performed,meaning most of that effort could be for partitions that were better left untouched.  Currently, planner will
lock,heap_open
 
*all* partitions, create range table entries and AppendRelInfos  for them, and finally initialize RelOptInfos for them,
eventouching the disk file of each partition in the process, in an earlier phase of planning.  All of that processing
isvain for partitions that are pruned, because they won't be included in the final plan.  This problem grows worse as
thenumber of partitions grows beyond thousands, because range table grows too big.
 

That could be fixed by delaying all that per-partition activity to a point where pruning has already been performed, so
thatwe know the partitions to open and create planning data structures for, such as somewhere downstream to
query_planner. But before we can do that we must do something about the fact that UPDATE/DELETE won't be able to cope
withthat because the code path that currently handles the planning of UPDATE/DELETE on partitioned tables
(inheritance_plannercalled from
 
subquery_planner) relies on AppendRelInfos for all partitions having been initialized by an earlier planning phase.
Delayingit to query_planner would be too late, because inheritance_planner calls query_planner for each partition, not
forthe parent.  That is, if query_planner, which is downstream to inheritance_planner, was in the charge of determining
whichpartitions to open, the latter wouldn't know which partitions to call the former for. :)
 

That would be fixed if there is no longer this ordering dependency, which is what I propose to do with the attached
patch0001.  I've tried to describe how the patch manages to do that in its commit message, but I'll summarize here.  As
thingsstand today, inheritance_planner modifies the query for each leaf partition to make the partition act as the
query'sresult relation instead of the original partitioned table and calls grouping_planner on the query.  That means
anythingthat's joined to partitioned table looks to instead be joined to the partition and join paths are generated
likewise. Also, the resulting path's targetlist is adjusted to be suitable for the result partition.  Upon studying how
thisworks, I concluded that the same result can be achieved if we call grouping_planner only once and repeat the
portionsof query_planner's and grouping_planner's processing that generate the join paths and appropriate target list,
respectively,for each partition.  That way, we can rely on query_planner determining result partitions for us, which in
turnrelies on the faster partprune.c based method of pruning.  That speeds things up in two ways.  Faster pruning and
weno longer repeat common processing for each partition.
 


With 0001 in place, there is nothing that requires that partitions be opened by an earlier planning phase, so, I
proposepatch 0002, which refactors the opening and creation of planner data structures for partitions such that it is
nowperformed after pruning. However, it doesn't do anything about the fact that partitions are all still locked in the
earlierphase.
 

With various overheads gone thanks to 0001 and 0002, locking of all partitions via find_all_inheritos can be seen as
thesingle largest bottleneck, which 0003 tries to address.  I've kept it a separate patch, because I'll need to think a
bitmore to say that it's actually to safe to defer locking to late planning, due mainly to the concern about the change
inthe order of locking from the current method.  I'm attaching it here, because I also want to show the performance
improvementwe can expect with it.
 


I measured the gain in performance due to each patch on a modest virtual machine.  Details of the measurement and
resultsfollow.
 

* Benchmark scripts

update.sql
update ht set a = 0 where b = 1;

select.sql
select * from ht where b = 1;

* Table:

create table ht (a int, b int) partition by hash (b) create table ht_1 partition of ht for values with (modulus N,
remainder0) ..
 
create table ht_N partition of ht for values with (modulus N, remainder N-1)

* Rounded tps with update.sql and select.sql against regular table (nparts = 0) and partitioned table with various
partitioncounts:
 

pgbench -n -T 60 -f update.sql

nparts  master    0001   0002   0003
======  ======    ====   ====   ====
0         2856    2893   2862   2816
8          507    1115   1447   1872
16         260     765   1173   1892
32         119     483    922   1884
64          59     282    615   1881
128         29     153    378   1835
256         14      79    210   1803
512          5      40    113   1728
1024         2      17     57   1616
2048         0*      9     30   1471
4096         0+      4     15   1236
8192         0=      2      7    975

* 0.46
+ 0.0064
= 0 (OOM on a virtual machine with 4GB RAM)

As can be seen here, 0001 is a big help for update queries.

pgbench -n -T 60 -f select.sql

For a select query that doesn't contain join and needs to scan only one
partition:

nparts  master    0001   0002   0003
======  ======    ====   ====   ====
0         2290    2329   2319   2268
8         1058    1077   1414   1788
16         711     729   1124   1789
32         450     475    879   1773
64         265     272    603   1765
128        146     149    371   1685
256        76      77    214   1678
512        39      39    112   1636
1024        16      17     59   1525
2048         8       9     29   1416
4096         4       4     15   1195
8192         2       2      7    932

Actually, here we get almost same numbers with 0001 as with master, because 0001 changes nothing for SELECT queries.
Westart seeing improvement with 0002, the patch to delay opening partitions.
 

Thanks,
Amit


pgsql-hackers by date:

Previous
From: Maksim Milyutin
Date:
Subject: Re: Hint to set owner for tablespace directory
Next
From: Alexander Korotkov
Date:
Subject: Re: Startup cost of sequential scan